Longest Common Substring in R Finding Non-Contiguous Matches Between the Two Strings

How to calculate longest common substring anywhere in two strings

One approach might be to look at the transformation sequence produced by adist() and count the characters in the longest contiguous match:

trafos <- attr(adist(string1, vec1, counts = TRUE), "trafos")
sapply(gregexpr("M+", trafos), function(x) max(0, attr(x, "match.length")))

[1] 3 8 1 3 5

R - Longest common substring

Check out the "Rlibstree" package on omegahat Github

This uses http://www.icir.org/christian/libstree/.

Identify a common pattern

Function LCS from qualV package (in Find common substrings between two character variables; not a possible duplicate) does something else than what you need. It solves the longest common subsequence problem, where subsequences are not required to occupy consecutive positions within the original sequences.

What you have is the longest common substring problem, for which you could use this algorithm, and here is the code assuming that there is a unique (in terms of length) longest common substring:

a <- "WWDUISBURG-HAMBORNS"
b <- "QQQQQQDUISBURG (-31.7.29)S"

A <- strsplit(a, "")[[1]]
B <- strsplit(b, "")[[1]]

L <- matrix(0, length(A), length(B))
ones <- which(outer(A, B, "=="), arr.ind = TRUE)
ones <- ones[order(ones[, 1]), ]
for(i in 1:nrow(ones)) {
v <- ones[i, , drop = FALSE]
L[v] <- ifelse(any(v == 1), 1, L[v - 1] + 1)
}
paste0(A[(-max(L) + 1):0 + which(L == max(L), arr.ind = TRUE)[1]], collapse = "")
# [1] "DUISBURG"

Finding Longest Common Substring with starting indexes

I am assuming if these are the two strings :
s1 = "abcdxyz"
s2 = "xyzabcd"
then since abcd is longest common substring so you need index of this substring in both s1 and s2 which is 0,3 respectively.

There are two solution for this :

Solution 1 :

Here, I have created an index array where I am storing the starting index of both the string with index 0 of index array storing for s1 and index 1 storing for s2.

public Answer  find(String s1,String s2){

int n = s1.length();
int m = s2.length();

Answer ans = new Answer(0, 0, 0);
int[] a = new int[m];
int b[] = new int[m];
int indexes[] = new int[2];
for(int i = 0;i<n;i++){
for(int j = 0;j<m;j++){
if(s1.charAt(i)==s2.charAt(j)){
if(i==0 || j==0 )a[j] = 1;
else{
a[j] = b[j-1] + 1;
}
if(a[j]>ans.len) {
ans.len = a[j];
indexes[0]=(i+1) - ans.len;
indexes[1]=(j+1) - ans.len;
}
ans.i = i;
ans.j = j;

}

}
int[] c = a;
a = b;
b = c;
}
return ans;
}

Solution 2 :

I am not sure what your Answer object i and j values are doing but we can make them as well store these values with i storing for s1 string and j storing for s2 string instead of creating different index array as in Solution 1.

public Answer  find(String s1,String s2){

int n = s1.length();
int m = s2.length();

Answer ans = new Answer(0, 0, 0);
int[] a = new int[m];
int b[] = new int[m];
int indexes[] = new int[2];
for(int i = 0;i<n;i++){
for(int j = 0;j<m;j++){
if(s1.charAt(i)==s2.charAt(j)){
if(i==0 || j==0 )a[j] = 1;
else{
a[j] = b[j-1] + 1;
}
if(a[j]>ans.len) {
ans.len = a[j];
ans.i=(i+1) - ans.len;
ans.j=(j+1) - ans.len;
}

}

}
int[] c = a;
a = b;
b = c;
}
return ans;
}

Currently this doesn't calculate LCS right . Issue is you are not making array a as empty after running your second loop each time because of which if characters do not match in next run corresponding index of a stores previous value only but it should be 0.

Update code is :

 public Answer  find(String s1,String s2){

int n = s1.length();
int m = s2.length();

Answer ans = new Answer(0, 0, 0);
int[] a;
int b[] = new int[m];
int indexes[] = new int[2];
for(int i = 0;i<n;i++){
a = new int[m];
for(int j = 0;j<m;j++){
if(s1.charAt(i)==s2.charAt(j)){
if(i==0 || j==0 )a[j] = 1;
else{
a[j] = b[j-1] + 1;
}
if(a[j]>ans.len) {
ans.len = a[j];
ans.i=(i+1) - ans.len;
ans.j=(j+1) - ans.len;
}

}

}
b = a;
}
return ans;
}

Find common substrings between two character variables

Here's a CRAN package for that:

library(qualV)

sapply(seq_along(a), function(i)
paste(LCS(strsplit(a[i], '')[[1]], strsplit(b[i], '')[[1]])$LCS,
collapse = ""))

R implementation for Finding the longest common starting substrings in a set of strings

Taking inspiration from what you suggested, you can try this function :

comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# compute the cumulative sum of common elements
cs_x<-cumsum(d_x[[1]]==d_x[[2]])
# check if there is at least one common element
if(cs_x[1]!=0) {
# see when it stops incrementing and get the position of last common element
der_com<-which(diff(cs_x)==0)[1]
# return the common part
return(substr(x[1],1,der_com))
} else { # else, return an empty vector
return(character(0))
}
}

UPDATE

Following @nicola suggestion, a simpler and more elegant variant for the function:

comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# search for the first not common element and so, get the last matching one
der_com<-match(FALSE,do.call("==",d_x))-1
# if there is no matching element, return an empty vector, else return the common part
ifelse(der_com==0,return(character(0)),return(substr(x[1],1,der_com)))
}

Examples:

With your data

x<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7", 
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
> comsub(x)
#[1] "ADA44"

When there is no common starting substring

x<-c("abc","def")
> comsub(x)
# character(0)


Related Topics



Leave a reply



Submit