R - Longest Common Substring

R - Longest common substring

Check out the "Rlibstree" package on omegahat Github

This uses http://www.icir.org/christian/libstree/.

How to calculate longest common substring anywhere in two strings

One approach might be to look at the transformation sequence produced by adist() and count the characters in the longest contiguous match:

trafos <- attr(adist(string1, vec1, counts = TRUE), "trafos")
sapply(gregexpr("M+", trafos), function(x) max(0, attr(x, "match.length")))

[1] 3 8 1 3 5

R implementation for Finding the longest common starting substrings in a set of strings

Taking inspiration from what you suggested, you can try this function :

comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# compute the cumulative sum of common elements
cs_x<-cumsum(d_x[[1]]==d_x[[2]])
# check if there is at least one common element
if(cs_x[1]!=0) {
# see when it stops incrementing and get the position of last common element
der_com<-which(diff(cs_x)==0)[1]
# return the common part
return(substr(x[1],1,der_com))
} else { # else, return an empty vector
return(character(0))
}
}

UPDATE

Following @nicola suggestion, a simpler and more elegant variant for the function:

comsub<-function(x) {
# sort the vector
x<-sort(x)
# split the first and last element by character
d_x<-strsplit(x[c(1,length(x))],"")
# search for the first not common element and so, get the last matching one
der_com<-match(FALSE,do.call("==",d_x))-1
# if there is no matching element, return an empty vector, else return the common part
ifelse(der_com==0,return(character(0)),return(substr(x[1],1,der_com)))
}

Examples:

With your data

x<-c("ADA4417-3ARMZ-R7", "ADA4430-1YKSZ-R2", "ADA4430-1YKSZ-R7", 
"ADA4431-1YCPZ-R2", "ADA4432-1BCPZ-R7", "ADA4432-1BRJZ-R2")
> comsub(x)
#[1] "ADA44"

When there is no common starting substring

x<-c("abc","def")
> comsub(x)
# character(0)

R: find largest common substring starting at the beginning

This will work for an arbitrary vector of words

words <- c('bestelling', 'bestelbon')
words.split <- strsplit(words, '')
words.split <- lapply(words.split, `length<-`, max(nchar(words)))
words.mat <- do.call(rbind, words.split)
common.substr.length <- which.max(apply(words.mat, 2, function(col) !length(unique(col)) == 1)) - 1
substr(words[1], 1, common.substr.length)
# [1] "bestel"

Identify a common pattern

Function LCS from qualV package (in Find common substrings between two character variables; not a possible duplicate) does something else than what you need. It solves the longest common subsequence problem, where subsequences are not required to occupy consecutive positions within the original sequences.

What you have is the longest common substring problem, for which you could use this algorithm, and here is the code assuming that there is a unique (in terms of length) longest common substring:

a <- "WWDUISBURG-HAMBORNS"
b <- "QQQQQQDUISBURG (-31.7.29)S"

A <- strsplit(a, "")[[1]]
B <- strsplit(b, "")[[1]]

L <- matrix(0, length(A), length(B))
ones <- which(outer(A, B, "=="), arr.ind = TRUE)
ones <- ones[order(ones[, 1]), ]
for(i in 1:nrow(ones)) {
v <- ones[i, , drop = FALSE]
L[v] <- ifelse(any(v == 1), 1, L[v - 1] + 1)
}
paste0(A[(-max(L) + 1):0 + which(L == max(L), arr.ind = TRUE)[1]], collapse = "")
# [1] "DUISBURG"

Find the longest common starting substring in a set of strings

It's a matter of taste, but this is a simple javascript version:
It sorts the array, and then looks just at the first and last items.

//longest common starting substring in an array

function sharedStart(array){
var A= array.concat().sort(),
a1= A[0], a2= A[A.length-1], L= a1.length, i= 0;
while(i<L && a1.charAt(i)=== a2.charAt(i)) i++;
return a1.substring(0, i);
}

DEMOS

sharedStart(['interspecies', 'interstelar', 'interstate'])  //=> 'inters'
sharedStart(['throne', 'throne']) //=> 'throne'
sharedStart(['throne', 'dungeon']) //=> ''
sharedStart(['cheese']) //=> 'cheese'
sharedStart([]) //=> ''
sharedStart(['prefix', 'suffix']) //=> ''


Related Topics



Leave a reply



Submit