Extracting Common Character Strings from Multiple Vectors of Different Lengths

Extracting common character strings from multiple vectors of different lengths



How to find common elements from multiple vectors?

There might be a cleverer way to go about this, but


will do the job.

EDIT: More cleverly, and more conveniently if you have a lot of arguments:

Reduce(intersect, list(a,b,c))

Extracting a number of a string of varying lengths

We can use str_extract with pattern \\d+ which means to match one or more numbers. It can be otherwise written as [0-9]+.

as.numeric(str_extract(testVector, "\\d+"))
#[1] 10 6 4 15

If there are multiple numbers in a string, we use str_extract_all which wil1 return a list output.

This can be also done with base R (no external packages used)

as.numeric(regmatches(testVector, regexpr("\\d+", testVector)))
#[1] 10 6 4 15

Or using gsub from base R

as.numeric(gsub("\\D+", "", testVector))
#[1] 10 6 4 15

BTW, some functions are just using the gsub, from extract_numeric

function (x) 
as.numeric(gsub("[^0-9.-]+", "", as.character(x)))

So, if we need a function, we can create one (without using any external packages)

ext_num <- function(x) {
as.numeric(gsub("\\D+", "", x))
#[1] 10 6 4 15

How can I extract matched part of multiple strings?

strsplit and intersect the overlapping parts recursively using Reduce. You can then piece it back together by paste-ing.

paste(Reduce(intersect, strsplit(data.dir, "\\\\")), collapse="\\")
#[1] "C:\\data\\files"

As @g-grothendieck notes, this will fail in certain circumstances like:

data.dir <- c("C:\\a\\b\\c\\", "C:\\a\\X\\c\\") 

An ugly hack might be something like:

lapply(strsplit(data.dir, "\\\\"),
function(x) sapply(1:length(x), function(y) paste(x[1:y], collapse="\\") )

...which will deal with either case.

Alternatively, use dirname if you only ever have one extra directory level:

#[1] "C:/data/files"

Extract character from string based on character in another vector in R

I think str_sub only works with strings but for the second string strsplit gives you a vector of 2 strings.

This would do the job in the case the separator only appears once in every string:

sapply(strsplit(a,split=b, fixed=FALSE), function(l) str_sub(l[[1]],-1,-1))

Find common substrings between two character variables

Here's a CRAN package for that:


sapply(seq_along(a), function(i)
paste(LCS(strsplit(a[i], '')[[1]], strsplit(b[i], '')[[1]])$LCS,
collapse = ""))

R intersecting strings

This works, but I'm not sure how robust it is given your question is a little vague.

Reduce(intersect, strsplit(basketball," "))
#[1] "MISS" "Pullup" "Jump" "Shot"

How do I select the longest string from each vector of strings in a list of vectors?

You can try:

lapply(lst, function(x) x[which.max(nchar(x))])

[1] "The quick brown fox"

[1] "And forever in peace may she wave"

how can I find same pattern among strings?

We can split the strings at each character and use intersect to get the common ones.

intersect(strsplit(a, "")[[1]], strsplit(b, "")[[1]])
#[1] "a" "b" "c"

To get the exact output as requested we can paste them together.

paste(intersect(strsplit(a, "")[[1]], strsplit(b, "")[[1]]), collapse = "")
#[1] "abc"

If there are multiple strings we can use Reduce (also see here):

a <- "abczzzzz"
b <- "rrrrabckkk"
c <- "dsaqwabc"

paste(Reduce(intersect, strsplit(c(a, b, c), "")), collapse = "")
#[1] "abc"

Related Topics

Leave a reply
