Find the Longest Common Starting Substring in a Set of Strings

Find the longest common starting substring in a set of strings

It's a matter of taste, but this is a simple javascript version:
It sorts the array, and then looks just at the first and last items.

//longest common starting substring in an array

function sharedStart(array){
var A= array.concat().sort(),
a1= A[0], a2= A[A.length-1], L= a1.length, i= 0;
while(i<L && a1.charAt(i)=== a2.charAt(i)) i++;
return a1.substring(0, i);
}

DEMOS

sharedStart(['interspecies', 'interstelar', 'interstate'])  //=> 'inters'
sharedStart(['throne', 'throne']) //=> 'throne'
sharedStart(['throne', 'dungeon']) //=> ''
sharedStart(['cheese']) //=> 'cheese'
sharedStart([]) //=> ''
sharedStart(['prefix', 'suffix']) //=> ''

R implementation for Finding the longest common starting substrings in a set of strings

Taking inspiration from what you suggested, you can try this function :

comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# compute the cumulative sum of common elements
cs_x<-cumsum(d_x[[1]]==d_x[[2]])
# check if there is at least one common element
if(cs_x[1]!=0) {
# see when it stops incrementing and get the position of last common element
der_com<-which(diff(cs_x)==0)[1]
# return the common part
return(substr(x[1],1,der_com))
} else { # else, return an empty vector
return(character(0))
}
}

UPDATE

Following @nicola suggestion, a simpler and more elegant variant for the function:

comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# search for the first not common element and so, get the last matching one
der_com<-match(FALSE,do.call("==",d_x))-1
# if there is no matching element, return an empty vector, else return the common part
ifelse(der_com==0,return(character(0)),return(substr(x[1],1,der_com)))
}

Examples:

With your data

x<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7", 
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
> comsub(x)
#[1] "ADA44"

When there is no common starting substring

x<-c("abc","def")
> comsub(x)
# character(0)

Find the Longest Common starting substring of S2 in S1

Here is a solution that emits the faint aroma of a hack.

Suppose

s1 = 'snowballing'
s2 = 'baller'

Then form the string

s = s2 + '|' + s1
#=> 'baller|snowballing'

where the pipe ('|') can be any character that is not in either string. (If in doubt, one could use, say, "\x00".)

We may then match s against the regular expression

^(.*)(?=.*\|.*\1)

This will match the longest starting string in s2 that is present in s1, which in this example is 'ball'.

Demo

The regular expression can be broken down as follows.

^        # match beginning of string
( # begin capture group 1
.* # match zero or more characters, as many as possible
) # end capture group 1
(?= # begin a positive lookahead
.* # match zero or more characters, as many as possible
\| # match '|'
.* # match zero or more characters, as many as possible
\1 # match the contents of capture group 1
) # end positive lookahead

Group strings by longest common starting substring

If you just want to simply group by longest common prefix (that means "apple ipad mini" will be chosen even though "apple ipad" would yield a larger group), then maybe something like this?

sort the list
i = 0
while i < end of list:
key = longest common prefix of list[i] & list[i + 1]
or list[i] if the common prefix is less than (1?) words or i is the last index
groups[key] = list[i++]
while key is prefix of list[i]:
add list[i++] to groups[key]

Any Framework functions helping to find the longest common starting substring of multiple strings?

If anyone is interested, here's what I came up with:

    public static string GetCommonStartingSubString(IList<string> strings)
{
if (strings.Count == 0)
return "";
if (strings.Count == 1)
return strings[0];
int charIdx = 0;
while (IsCommonChar(strings, charIdx))
++charIdx;
return strings[0].Substring(0, charIdx);
}
private static bool IsCommonChar(IList<string> strings, int charIdx)
{
if(strings[0].Length <= charIdx)
return false;
for (int strIdx = 1; strIdx < strings.Count; ++strIdx)
if (strings[strIdx].Length <= charIdx
|| strings[strIdx][charIdx] != strings[0][charIdx])
return false;
return true;
}

Finding Longest Common Substring with starting indexes

I am assuming if these are the two strings :
s1 = "abcdxyz"
s2 = "xyzabcd"
then since abcd is longest common substring so you need index of this substring in both s1 and s2 which is 0,3 respectively.

There are two solution for this :

Solution 1 :

Here, I have created an index array where I am storing the starting index of both the string with index 0 of index array storing for s1 and index 1 storing for s2.

public Answer  find(String s1,String s2){

int n = s1.length();
int m = s2.length();

Answer ans = new Answer(0, 0, 0);
int[] a = new int[m];
int b[] = new int[m];
int indexes[] = new int[2];
for(int i = 0;i<n;i++){
for(int j = 0;j<m;j++){
if(s1.charAt(i)==s2.charAt(j)){
if(i==0 || j==0 )a[j] = 1;
else{
a[j] = b[j-1] + 1;
}
if(a[j]>ans.len) {
ans.len = a[j];
indexes[0]=(i+1) - ans.len;
indexes[1]=(j+1) - ans.len;
}
ans.i = i;
ans.j = j;

}

}
int[] c = a;
a = b;
b = c;
}
return ans;
}

Solution 2 :

I am not sure what your Answer object i and j values are doing but we can make them as well store these values with i storing for s1 string and j storing for s2 string instead of creating different index array as in Solution 1.

public Answer  find(String s1,String s2){

int n = s1.length();
int m = s2.length();

Answer ans = new Answer(0, 0, 0);
int[] a = new int[m];
int b[] = new int[m];
int indexes[] = new int[2];
for(int i = 0;i<n;i++){
for(int j = 0;j<m;j++){
if(s1.charAt(i)==s2.charAt(j)){
if(i==0 || j==0 )a[j] = 1;
else{
a[j] = b[j-1] + 1;
}
if(a[j]>ans.len) {
ans.len = a[j];
ans.i=(i+1) - ans.len;
ans.j=(j+1) - ans.len;
}

}

}
int[] c = a;
a = b;
b = c;
}
return ans;
}

Currently this doesn't calculate LCS right . Issue is you are not making array a as empty after running your second loop each time because of which if characters do not match in next run corresponding index of a stores previous value only but it should be 0.

Update code is :

 public Answer  find(String s1,String s2){

int n = s1.length();
int m = s2.length();

Answer ans = new Answer(0, 0, 0);
int[] a;
int b[] = new int[m];
int indexes[] = new int[2];
for(int i = 0;i<n;i++){
a = new int[m];
for(int j = 0;j<m;j++){
if(s1.charAt(i)==s2.charAt(j)){
if(i==0 || j==0 )a[j] = 1;
else{
a[j] = b[j-1] + 1;
}
if(a[j]>ans.len) {
ans.len = a[j];
ans.i=(i+1) - ans.len;
ans.j=(j+1) - ans.len;
}

}

}
b = a;
}
return ans;
}


Related Topics



Leave a reply



Submit