Find the longest common starting substring in a set of strings
It's a matter of taste, but this is a simple javascript version:
It sorts the array, and then looks just at the first and last items.
//longest common starting substring in an array
function sharedStart(array){
var A= array.concat().sort(),
a1= A[0], a2= A[A.length-1], L= a1.length, i= 0;
while(i<L && a1.charAt(i)=== a2.charAt(i)) i++;
return a1.substring(0, i);
}
DEMOS
sharedStart(['interspecies', 'interstelar', 'interstate']) //=> 'inters'
sharedStart(['throne', 'throne']) //=> 'throne'
sharedStart(['throne', 'dungeon']) //=> ''
sharedStart(['cheese']) //=> 'cheese'
sharedStart([]) //=> ''
sharedStart(['prefix', 'suffix']) //=> ''
R implementation for Finding the longest common starting substrings in a set of strings
Taking inspiration from what you suggested, you can try this function :
comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# compute the cumulative sum of common elements
cs_x<-cumsum(d_x[[1]]==d_x[[2]])
# check if there is at least one common element
if(cs_x[1]!=0) {
# see when it stops incrementing and get the position of last common element
der_com<-which(diff(cs_x)==0)[1]
# return the common part
return(substr(x[1],1,der_com))
} else { # else, return an empty vector
return(character(0))
}
}
UPDATE
Following @nicola suggestion, a simpler and more elegant variant for the function:
comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# search for the first not common element and so, get the last matching one
der_com<-match(FALSE,do.call("==",d_x))-1
# if there is no matching element, return an empty vector, else return the common part
ifelse(der_com==0,return(character(0)),return(substr(x[1],1,der_com)))
}
Examples:
With your data
x<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7",
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
> comsub(x)
#[1] "ADA44"
When there is no common starting substring
x<-c("abc","def")
> comsub(x)
# character(0)
Find the Longest Common starting substring of S2 in S1
Here is a solution that emits the faint aroma of a hack.
Suppose
s1 = 'snowballing'
s2 = 'baller'
Then form the string
s = s2 + '|' + s1
#=> 'baller|snowballing'
where the pipe ('|'
) can be any character that is not in either string. (If in doubt, one could use, say, "\x00"
.)
We may then match s
against the regular expression
^(.*)(?=.*\|.*\1)
This will match the longest starting string in s2
that is present in s1
, which in this example is 'ball'
.
Demo
The regular expression can be broken down as follows.
^ # match beginning of string
( # begin capture group 1
.* # match zero or more characters, as many as possible
) # end capture group 1
(?= # begin a positive lookahead
.* # match zero or more characters, as many as possible
\| # match '|'
.* # match zero or more characters, as many as possible
\1 # match the contents of capture group 1
) # end positive lookahead
Group strings by longest common starting substring
If you just want to simply group by longest common prefix (that means "apple ipad mini" will be chosen even though "apple ipad" would yield a larger group), then maybe something like this?
sort the list
i = 0
while i < end of list:
key = longest common prefix of list[i] & list[i + 1]
or list[i] if the common prefix is less than (1?) words or i is the last index
groups[key] = list[i++]
while key is prefix of list[i]:
add list[i++] to groups[key]
Any Framework functions helping to find the longest common starting substring of multiple strings?
If anyone is interested, here's what I came up with:
public static string GetCommonStartingSubString(IList<string> strings)
{
if (strings.Count == 0)
return "";
if (strings.Count == 1)
return strings[0];
int charIdx = 0;
while (IsCommonChar(strings, charIdx))
++charIdx;
return strings[0].Substring(0, charIdx);
}
private static bool IsCommonChar(IList<string> strings, int charIdx)
{
if(strings[0].Length <= charIdx)
return false;
for (int strIdx = 1; strIdx < strings.Count; ++strIdx)
if (strings[strIdx].Length <= charIdx
|| strings[strIdx][charIdx] != strings[0][charIdx])
return false;
return true;
}
Finding Longest Common Substring with starting indexes
I am assuming if these are the two strings :s1
= "abcdxyz"s2
= "xyzabcd"
then since abcd
is longest common substring so you need index of this substring in both s1 and s2 which is 0,3 respectively.
There are two solution for this :
Solution 1 :
Here, I have created an index
array where I am storing the starting index of both the string with index 0 of index array storing for s1 and index 1 storing for s2.
public Answer find(String s1,String s2){
int n = s1.length();
int m = s2.length();
Answer ans = new Answer(0, 0, 0);
int[] a = new int[m];
int b[] = new int[m];
int indexes[] = new int[2];
for(int i = 0;i<n;i++){
for(int j = 0;j<m;j++){
if(s1.charAt(i)==s2.charAt(j)){
if(i==0 || j==0 )a[j] = 1;
else{
a[j] = b[j-1] + 1;
}
if(a[j]>ans.len) {
ans.len = a[j];
indexes[0]=(i+1) - ans.len;
indexes[1]=(j+1) - ans.len;
}
ans.i = i;
ans.j = j;
}
}
int[] c = a;
a = b;
b = c;
}
return ans;
}
Solution 2 :
I am not sure what your Answer
object i and j values are doing but we can make them as well store these values with i
storing for s1 string and j
storing for s2 string instead of creating different index
array as in Solution 1.
public Answer find(String s1,String s2){
int n = s1.length();
int m = s2.length();
Answer ans = new Answer(0, 0, 0);
int[] a = new int[m];
int b[] = new int[m];
int indexes[] = new int[2];
for(int i = 0;i<n;i++){
for(int j = 0;j<m;j++){
if(s1.charAt(i)==s2.charAt(j)){
if(i==0 || j==0 )a[j] = 1;
else{
a[j] = b[j-1] + 1;
}
if(a[j]>ans.len) {
ans.len = a[j];
ans.i=(i+1) - ans.len;
ans.j=(j+1) - ans.len;
}
}
}
int[] c = a;
a = b;
b = c;
}
return ans;
}
Currently this doesn't calculate LCS right . Issue is you are not making array a
as empty after running your second loop each time because of which if characters do not match in next run corresponding index of a
stores previous value only but it should be 0.
Update code is :
public Answer find(String s1,String s2){
int n = s1.length();
int m = s2.length();
Answer ans = new Answer(0, 0, 0);
int[] a;
int b[] = new int[m];
int indexes[] = new int[2];
for(int i = 0;i<n;i++){
a = new int[m];
for(int j = 0;j<m;j++){
if(s1.charAt(i)==s2.charAt(j)){
if(i==0 || j==0 )a[j] = 1;
else{
a[j] = b[j-1] + 1;
}
if(a[j]>ans.len) {
ans.len = a[j];
ans.i=(i+1) - ans.len;
ans.j=(j+1) - ans.len;
}
}
}
b = a;
}
return ans;
}
Related Topics
Coordinates of Selected Text in Browser Page
Very Simple, Very Smooth, JavaScript Marquee
Programmatically Selecting Partial Text in an Input Field
Want to Add "Addeventlistener" on Multiple Elements With Same Class
React.Js: Set Innerhtml VS Dangerouslysetinnerhtml
Best Way to Detect That Html5 ≪Canvas≫ Is Not Supported
Html5 Local Storage Fallback Solutions
Will Html5 Allow Web Apps to Make Peer-To-Peer Http Connections
How to Connect HTML Divs With Lines
Dynamically Loading CSS Stylesheet Doesn't Work on Ie
Jquery Event Won't Fire After Ajax Call
Get Iframe'S Document, from JavaScript in Main Document
Convert HTML to Data:Text/Html Link Using JavaScript
How to Make a Jquery "$.Post" Request Synchronous
How to Disable Scroll Without Hiding It