Difference of Two Character Vectors with Substring

Compare two character vectors in R

Here are some basics to try out:

> A = c("Dog", "Cat", "Mouse")
> B = c("Tiger","Lion","Cat")
> A %in% B
[1] FALSE  TRUE FALSE
> intersect(A,B)
[1] "Cat"
> setdiff(A,B)
[1] "Dog"   "Mouse"
> setdiff(B,A)
[1] "Tiger" "Lion"

Similarly, you could get counts simply as:

> length(intersect(A,B))
[1] 1
> length(setdiff(A,B))
[1] 2
> length(setdiff(B,A))
[1] 2

How to compare characters of two string at each index?

apply(do.call(rbind, strsplit(c(string1, string2), "")), 2, function(x){
    length(unique(x[!x %in% "_"])) == 1
})
#[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

You could also slightly modify Rich's deleted answer

Reduce(f = function(s1, s2){
    s1 == s2 | s1 == "_" | s2 == "_"
},
x = strsplit(c(string1, string2), ""))
#[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

Note that the first approach will allow comparison of more than two strings

Truncation and merging values in two character vectors

Consider following using vectorized methods of ifelse, substr, and regexpr (i.e., no apply loops):

newV1 <- ifelse(substr(V1, 24, 24) == ",",         # CONDITIONALLY CHECK 24TH CHARACTER
                substr(V1, 1, regexpr(",", V1)-1), # EXTRACT UNTIL 24TH CHARACTER
                substr(V1, 1, 
                       regexpr(" (?=[^ ]+$)", 
                               substr(V1, 1, 24), 
                               perl=TRUE)-1)     # EXTRACT UNTIL LAST SPACE BEFORE 24TH CHAR
                )
newV1
# [1] "377 Peninsula St. Ogden" "8532 West Lyme St."     
# [3] "43 E. Hilltop Street"    "95 Newcastle St."       
# [5] "7276 Rose St."        

newV2 <- paste(ifelse(substr(V1, 24, 24) == ",",   # CONDITIONALLY CHECK 24TH CHARACTER
               substr(V1, regexpr(",", V1)+1, 
                      nchar(V1)),                  # EXTRACT AFTER 24TH CHARACTER
               substr(V1, 
                      regexpr(" (?=[^ ]+$)", 
                              substr(V1, 1, 24), 
                              perl=TRUE)+1, 
                      nchar(V1))),               # EXTRACT AFTER LAST SPACE BEFORE 24TH CHAR
               V2)                               # PASTE V2 VECTOR ELEMENTWISE
newV2
# [1] "UT 84404"                "Chesterfield, VA 23832" 
# [3] "Hilliard,OH 43026"       "Hendersonville,NC 28792"
# [5] "Greenville,NC 27834"

Rextester Demo

Comparing string vectors and quantifying differences

I've arrived to fairly easy answer to my own question. And it is Levenshtein distance. Or adist() in R.

Long story short:

df$c <- 1 - diag(adist(df$a, df$b, fixed = F)) / apply(cbind(nchar(df$a), nchar(df$b)), 1, max)

This does the trick.

> df
           a          b   c
1 NEWYORK001 NEWYORK001 1.0
2 ORLANDO002    ORLANDO 0.7
3  BOSTON003  BOSTON003 1.0
4 CHICAGO004 CHICAGO005 0.9
5 ATLANTA005 005ATLANTA 0.7

Update:

Running the function on one of my data sets returns cute result (that made my inner nerd chuckle a bit):

Error: cannot allocate vector of size 1650.7 Gb

So, I guess it's another apply() loop for adist(), taking diagonal of the whole matrix is... well, fairly inefficient.

df$c <- 1 - apply(cbind(df$a, df$b),1, function(x) adist(x[1], x[2], fixed = F)) / apply(cbind(nchar(df$a), nchar(df$b)), 1, max)

This modification yields very satisfying results.

R: Filter vectors by 'two-way' partial match

I originally thought ?pmatch might be handy, but your edit clarifies you don't just want to match the start of items. Here's a function that should work:

remover <- function(x,y) {
    pmx <- sapply(x, grep, x=y)
    pmy <- sapply(y, grep, x=x)

    hit <- unlist(c(pmx,pmy))

    list(
        x[!(seq_along(x) %in% hit)],
        y[!(seq_along(y) %in% hit)]
    )
}

remover(x,y)
#[[1]]
#character(0)
#
#[[2]]
#[1] "nomatch"

It correctly does nothing when no match is found (thanks @Frank for picking up the earlier error):

remover("yo","nomatch")
#[[1]]
#[1] "yo"
# 
#[[2]]
#[1] "nomatch"

Find common substrings between two character variables

Here's a CRAN package for that:

library(qualV)

sapply(seq_along(a), function(i)
    paste(LCS(strsplit(a[i], '')[[1]], strsplit(b[i], '')[[1]])$LCS,
          collapse = ""))

Substring to character comparison counterintuitive results in Julia 1.0

sq[1] returns a Char. Use sq[1:1] to get a String.

You can check what sq[1] returns in REPL:

julia> sq[1]
'a': ASCII/Unicode U+0061 (category Ll: Letter, lowercase)

so you have:

julia> sq[1] == 'a'
true

as this compares Char to Char.

on the other hand with sq[1:1] you have:

julia> sq[1:1]
"a"

julia> sq[1:1] == "a"
true

The reason for this behavior is that strings are considered as collections. Similarly if you have an array x = [1,2,3] you do not expect that x[1] == [1] but rather x[1] == 1.

Remove entries from string vector containing specific characters in R

We can use grep to find out which values in y match the pattern in x and exclude them using !%in%

y[!y %in% grep(paste0(x, collapse = "|"), y, value = T)]

#[1] "kot" "kk"  "y"

Or even better with grepl as it returns boolean vectors

y[!grepl(paste0(x, collapse = "|"), y)]

A concise version with grep using invert and value parameter

grep(paste0(x, collapse = "|"), y, invert = TRUE, value = TRUE)
#[1] "kot" "kk"  "y"

Match substring of two vectors and create a new vector combining them

A possible solution:

a <- setNames(a, substr(a, 2, 3))
b <- setNames(b, substr(b, 1, 2))

df <- merge(stack(a), stack(b), by = 'ind')
paste0(substr(df$values.x, 1, 1), df$values.y)

which gives:

[1] "1234" "1238" "2234" "2238" "4325" "4326" "2342"

A second alternative:

a <- setNames(a, substr(a, 2, 3))
b <- setNames(b, substr(b, 1, 2))

l <- lapply(names(a), function(x) b[x == names(b)])
paste0(substr(rep(a, lengths(l)), 1, 1), unlist(l))

which gives the same result and is considerably faster (see the benchmark).

R: How to use setdiff on two string vectors by only comparing the first 3 tab delimited items in each string?

For extracting the first three columns (not sure why you need this as a long string rather than a dataframe...), I would use beg2char() from the qdap library. (Although, if they are all the same length base substr() will work fine.)

beg2char(list1, '\t', 3) # Will extract from the beginning up to the third tab delimiter

Then rather than setdiff I would simply use %in% to check if the substring of the element in list2 matches any of the elements in list1.

beg2char(list2, '\t', 3) %in% beg2char(list1, '\t', 3) # will give you TRUE/FALSE
list2[!(beg2char(list2, '\t', 3) %in% beg2char(list1, '\t', 3))]

Will give the the full elements of list2 that have substring that are nonexistent in list1.