How to calculate longest common substring anywhere in two strings
One approach might be to look at the transformation sequence produced by adist()
and count the characters in the longest contiguous match:
trafos <- attr(adist(string1, vec1, counts = TRUE), "trafos")
sapply(gregexpr("M+", trafos), function(x) max(0, attr(x, "match.length")))
[1] 3 8 1 3 5
R - Longest common substring
Check out the "Rlibstree" package on omegahat Github
This uses http://www.icir.org/christian/libstree/.
Identify a common pattern
Function LCS
from qualV
package (in Find common substrings between two character variables; not a possible duplicate) does something else than what you need. It solves the longest common subsequence problem, where subsequences are not required to occupy consecutive positions within the original sequences.
What you have is the longest common substring problem, for which you could use this algorithm, and here is the code assuming that there is a unique (in terms of length) longest common substring:
a <- "WWDUISBURG-HAMBORNS"
b <- "QQQQQQDUISBURG (-31.7.29)S"
A <- strsplit(a, "")[[1]]
B <- strsplit(b, "")[[1]]
L <- matrix(0, length(A), length(B))
ones <- which(outer(A, B, "=="), arr.ind = TRUE)
ones <- ones[order(ones[, 1]), ]
for(i in 1:nrow(ones)) {
v <- ones[i, , drop = FALSE]
L[v] <- ifelse(any(v == 1), 1, L[v - 1] + 1)
}
paste0(A[(-max(L) + 1):0 + which(L == max(L), arr.ind = TRUE)[1]], collapse = "")
# [1] "DUISBURG"
Finding Longest Common Substring with starting indexes
I am assuming if these are the two strings :s1
= "abcdxyz"s2
= "xyzabcd"
then since abcd
is longest common substring so you need index of this substring in both s1 and s2 which is 0,3 respectively.
There are two solution for this :
Solution 1 :
Here, I have created an index
array where I am storing the starting index of both the string with index 0 of index array storing for s1 and index 1 storing for s2.
public Answer find(String s1,String s2){
int n = s1.length();
int m = s2.length();
Answer ans = new Answer(0, 0, 0);
int[] a = new int[m];
int b[] = new int[m];
int indexes[] = new int[2];
for(int i = 0;i<n;i++){
for(int j = 0;j<m;j++){
if(s1.charAt(i)==s2.charAt(j)){
if(i==0 || j==0 )a[j] = 1;
else{
a[j] = b[j-1] + 1;
}
if(a[j]>ans.len) {
ans.len = a[j];
indexes[0]=(i+1) - ans.len;
indexes[1]=(j+1) - ans.len;
}
ans.i = i;
ans.j = j;
}
}
int[] c = a;
a = b;
b = c;
}
return ans;
}
Solution 2 :
I am not sure what your Answer
object i and j values are doing but we can make them as well store these values with i
storing for s1 string and j
storing for s2 string instead of creating different index
array as in Solution 1.
public Answer find(String s1,String s2){
int n = s1.length();
int m = s2.length();
Answer ans = new Answer(0, 0, 0);
int[] a = new int[m];
int b[] = new int[m];
int indexes[] = new int[2];
for(int i = 0;i<n;i++){
for(int j = 0;j<m;j++){
if(s1.charAt(i)==s2.charAt(j)){
if(i==0 || j==0 )a[j] = 1;
else{
a[j] = b[j-1] + 1;
}
if(a[j]>ans.len) {
ans.len = a[j];
ans.i=(i+1) - ans.len;
ans.j=(j+1) - ans.len;
}
}
}
int[] c = a;
a = b;
b = c;
}
return ans;
}
Currently this doesn't calculate LCS right . Issue is you are not making array a
as empty after running your second loop each time because of which if characters do not match in next run corresponding index of a
stores previous value only but it should be 0.
Update code is :
public Answer find(String s1,String s2){
int n = s1.length();
int m = s2.length();
Answer ans = new Answer(0, 0, 0);
int[] a;
int b[] = new int[m];
int indexes[] = new int[2];
for(int i = 0;i<n;i++){
a = new int[m];
for(int j = 0;j<m;j++){
if(s1.charAt(i)==s2.charAt(j)){
if(i==0 || j==0 )a[j] = 1;
else{
a[j] = b[j-1] + 1;
}
if(a[j]>ans.len) {
ans.len = a[j];
ans.i=(i+1) - ans.len;
ans.j=(j+1) - ans.len;
}
}
}
b = a;
}
return ans;
}
Find common substrings between two character variables
Here's a CRAN package for that:
library(qualV)
sapply(seq_along(a), function(i)
paste(LCS(strsplit(a[i], '')[[1]], strsplit(b[i], '')[[1]])$LCS,
collapse = ""))
R implementation for Finding the longest common starting substrings in a set of strings
Taking inspiration from what you suggested, you can try this function :
comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# compute the cumulative sum of common elements
cs_x<-cumsum(d_x[[1]]==d_x[[2]])
# check if there is at least one common element
if(cs_x[1]!=0) {
# see when it stops incrementing and get the position of last common element
der_com<-which(diff(cs_x)==0)[1]
# return the common part
return(substr(x[1],1,der_com))
} else { # else, return an empty vector
return(character(0))
}
}
UPDATE
Following @nicola suggestion, a simpler and more elegant variant for the function:
comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# search for the first not common element and so, get the last matching one
der_com<-match(FALSE,do.call("==",d_x))-1
# if there is no matching element, return an empty vector, else return the common part
ifelse(der_com==0,return(character(0)),return(substr(x[1],1,der_com)))
}
Examples:
With your data
x<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7",
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
> comsub(x)
#[1] "ADA44"
When there is no common starting substring
x<-c("abc","def")
> comsub(x)
# character(0)
Related Topics
Convert Data.Frame Column to a Vector
Add New Row to Dataframe, at Specific Row-Index, Not Appended
How to Search for "R" Materials
Use Trycatch Skip to Next Value of Loop Upon Error
Using R to List All Files with a Specified Extension
How to Connect Two Coordinates with a Line Using Leaflet in R
Plotting Multiple Time-Series in Ggplot
Why Is Apply() Method Slower Than a for Loop in R
How to Draw Stacked Bars in Ggplot2 That Show Percentages Based on Group
Passing Command Line Arguments to R Cmd Batch
How to Obtain an 'Unbalanced' Grid of Ggplots
Changing Font Size and Direction of Axes Text in Ggplot2
Append Value to Empty Vector in R
How to Make R Beep/Play a Sound at the End of a Script
Explain Ggplot2 Warning: "Removed K Rows Containing Missing Values"