Compare Two Character Vectors in R

Compare two character vectors in R

Here are some basics to try out:

> A = c("Dog", "Cat", "Mouse")
> B = c("Tiger","Lion","Cat")
> A %in% B
[1] FALSE  TRUE FALSE
> intersect(A,B)
[1] "Cat"
> setdiff(A,B)
[1] "Dog"   "Mouse"
> setdiff(B,A)
[1] "Tiger" "Lion"

Similarly, you could get counts simply as:

> length(intersect(A,B))
[1] 1
> length(setdiff(A,B))
[1] 2
> length(setdiff(B,A))
[1] 2

Comparing character vectors in R to find unique and/or missing values

setdiff(x,y)

Will do the job for you.

How to compare two character vectors in R

You may use match

data.frame(id = df$id[match(vec, df$name)], vec)

#    id    vec
#1   NA    age
#2 1232    gpa
#3 1988  class
#4   NA sports

Or merge the dataframes using left join.

merge(data.frame(name = vec), df, all.x = TRUE, by = "name")

Compare the similarity of character vectors by position

Using Levenshtein (edit) distance, or rather 1-distance

> 1-adist(df$sequence)/4

     [,1] [,2] [,3] [,4]
[1,] 1.00 0.75 0.25 0.25
[2,] 0.75 1.00 0.00 0.25
[3,] 0.25 0.00 1.00 0.50
[4,] 0.25 0.25 0.50 1.00

(assuming all lengths equal to 4).

Edit: I misunderstood your problem. Levenshtein distance finds maximal matching, so reordering the strings if necessary. You want an exact word for word matching, in that case...

sapply(df$sequence,function(x){
  sapply(df$sequence,function(y){
    sum(strsplit(x,"")[[1]]==strsplit(y,"")[[1]])
  })
})/4

     ACAC AGAC CCTT CGCT
ACAC 1.00 0.75 0.25 0.00
AGAC 0.75 1.00 0.00 0.25
CCTT 0.25 0.00 1.00 0.50
CGCT 0.00 0.25 0.50 1.00

or for the other vector provided in the comments

sapply(df$sequence,function(x){
  sapply(df$sequence,function(y){
    sum(strsplit(x,"")[[1]]==strsplit(y,"")[[1]])
  })
})/4

     GACC AAAC ACAC GCCA
GACC 1.00 0.50 0.25 0.50
AAAC 0.50 1.00 0.75 0.00
ACAC 0.25 0.75 1.00 0.25
GCCA 0.50 0.00 0.25 1.00

How to compare characters of two string at each index?

apply(do.call(rbind, strsplit(c(string1, string2), "")), 2, function(x){
    length(unique(x[!x %in% "_"])) == 1
})
#[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

You could also slightly modify Rich's deleted answer

Reduce(f = function(s1, s2){
    s1 == s2 | s1 == "_" | s2 == "_"
},
x = strsplit(c(string1, string2), ""))
#[1]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE FALSE  TRUE

Note that the first approach will allow comparison of more than two strings

vectorised comparison of strings to single value in Rcpp

Following up on our quick discussion, here is a very simple solution as the problem (as posed) is simple -- no regular expression, no fancyness. Just loop over all elements and return as soon as match is found, else bail with false.

Code

#include <Rcpp.h>

// [[Rcpp::export]]
bool contains(std::vector<std::string> sv, std::string txt) {
    for (auto s: sv) {
        if (s == txt) return true;
    }
    return false;
}

/*** R
sv <- c("a", "b", "c")
contains(sv, "foo")
sv[2] <- "foo"
contains(sv, "foo")
*/

Demo

> Rcpp::sourceCpp("~/git/stackoverflow/66895973/answer.cpp")

> sv <- c("a", "b", "c")

> contains(sv, "foo")
[1] FALSE

> sv[2] <- "foo"

> contains(sv, "foo")
[1] TRUE
>

And that is really just shooting from the hip before looking for either what we may already have in the (roughly) 100k lines of Rcpp, or what the STL may have...

The same will work for your earlier example of named attributes as you can the same, of course, with a CharacterVector, and/or use the conversion from it to std::vector<std::string> we used here, or... If you have an older compiler, switch the for from C++11 style to K+R style.

Compare two vectors within a data frame with %in% with R

Another way you could achieve this (using your original approach with strsplit) is to do it rowwise() and 'sum' the logical test.

T1 %>% 
  rowwise() %>% 
  filter(sum(unlist(strsplit(Col2,",")) %in% c("a","e","g")) >= 1)

How to fuzzy match two character vectors in r

With stringr, use str_detect, or str_count if you want a real count:

library(stringr)
library(dplyr)
df %>% 
  mutate(fruits_in_list = +(str_detect(fruits_eat, paste0(fruits_list, collapse = "|"))),
         count = str_count(fruits_eat, paste0(fruits_list, collapse = "|")))

     id                                  fruits_eat fruits_in_list count
1  Jack              XXappleYYY,lemon,orange,pitaya              1     3
2  Rose Navel orange,Blood orange,watermelon,cherry              1     3
3 Biden                        pitaya,cherry,banana              0     0

Compare Two Character Vectors in R