Compare two character vectors in R
Here are some basics to try out:
> A = c("Dog", "Cat", "Mouse")
> B = c("Tiger","Lion","Cat")
> A %in% B
[1] FALSE TRUE FALSE
> intersect(A,B)
[1] "Cat"
> setdiff(A,B)
[1] "Dog" "Mouse"
> setdiff(B,A)
[1] "Tiger" "Lion"
Similarly, you could get counts simply as:
> length(intersect(A,B))
[1] 1
> length(setdiff(A,B))
[1] 2
> length(setdiff(B,A))
[1] 2
Comparing character vectors in R to find unique and/or missing values
setdiff(x,y)
Will do the job for you.
How to compare two character vectors in R
You may use match
data.frame(id = df$id[match(vec, df$name)], vec)
# id vec
#1 NA age
#2 1232 gpa
#3 1988 class
#4 NA sports
Or merge
the dataframes using left join.
merge(data.frame(name = vec), df, all.x = TRUE, by = "name")
Compare the similarity of character vectors by position
Using Levenshtein (edit) distance, or rather 1-distance
> 1-adist(df$sequence)/4
[,1] [,2] [,3] [,4]
[1,] 1.00 0.75 0.25 0.25
[2,] 0.75 1.00 0.00 0.25
[3,] 0.25 0.00 1.00 0.50
[4,] 0.25 0.25 0.50 1.00
(assuming all lengths equal to 4).
Edit: I misunderstood your problem. Levenshtein distance finds maximal matching, so reordering the strings if necessary. You want an exact word for word matching, in that case...
sapply(df$sequence,function(x){
sapply(df$sequence,function(y){
sum(strsplit(x,"")[[1]]==strsplit(y,"")[[1]])
})
})/4
ACAC AGAC CCTT CGCT
ACAC 1.00 0.75 0.25 0.00
AGAC 0.75 1.00 0.00 0.25
CCTT 0.25 0.00 1.00 0.50
CGCT 0.00 0.25 0.50 1.00
or for the other vector provided in the comments
sapply(df$sequence,function(x){
sapply(df$sequence,function(y){
sum(strsplit(x,"")[[1]]==strsplit(y,"")[[1]])
})
})/4
GACC AAAC ACAC GCCA
GACC 1.00 0.50 0.25 0.50
AAAC 0.50 1.00 0.75 0.00
ACAC 0.25 0.75 1.00 0.25
GCCA 0.50 0.00 0.25 1.00
How to compare characters of two string at each index?
apply(do.call(rbind, strsplit(c(string1, string2), "")), 2, function(x){
length(unique(x[!x %in% "_"])) == 1
})
#[1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
You could also slightly modify Rich's deleted answer
Reduce(f = function(s1, s2){
s1 == s2 | s1 == "_" | s2 == "_"
},
x = strsplit(c(string1, string2), ""))
#[1] TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE
Note that the first approach will allow comparison of more than two strings
vectorised comparison of strings to single value in Rcpp
Following up on our quick discussion, here is a very simple solution as the problem (as posed) is simple -- no regular expression, no fancyness. Just loop over all elements and return as soon as match is found, else bail with false
.
Code
#include <Rcpp.h>
// [[Rcpp::export]]
bool contains(std::vector<std::string> sv, std::string txt) {
for (auto s: sv) {
if (s == txt) return true;
}
return false;
}
/*** R
sv <- c("a", "b", "c")
contains(sv, "foo")
sv[2] <- "foo"
contains(sv, "foo")
*/
Demo
> Rcpp::sourceCpp("~/git/stackoverflow/66895973/answer.cpp")
> sv <- c("a", "b", "c")
> contains(sv, "foo")
[1] FALSE
> sv[2] <- "foo"
> contains(sv, "foo")
[1] TRUE
>
And that is really just shooting from the hip before looking for either what we may already have in the (roughly) 100k lines of Rcpp, or what the STL may have...
The same will work for your earlier example of named attributes as you can the same, of course, with a CharacterVector
, and/or use the conversion from it to std::vector<std::string>
we used here, or... If you have an older compiler, switch the for
from C++11 style to K+R style.
Compare two vectors within a data frame with %in% with R
Another way you could achieve this (using your original approach with strsplit) is to do it rowwise()
and 'sum' the logical test.
T1 %>%
rowwise() %>%
filter(sum(unlist(strsplit(Col2,",")) %in% c("a","e","g")) >= 1)
How to fuzzy match two character vectors in r
With stringr
, use str_detect
, or str_count
if you want a real count:
library(stringr)
library(dplyr)
df %>%
mutate(fruits_in_list = +(str_detect(fruits_eat, paste0(fruits_list, collapse = "|"))),
count = str_count(fruits_eat, paste0(fruits_list, collapse = "|")))
id fruits_eat fruits_in_list count
1 Jack XXappleYYY,lemon,orange,pitaya 1 3
2 Rose Navel orange,Blood orange,watermelon,cherry 1 3
3 Biden pitaya,cherry,banana 0 0
Related Topics
Loop Character Values in Ggtitle
Error in Fetch(Key):Lazy-Load Database
Extracting Coefficient Variable Names from Glmnet into a Data.Frame
Add Moving Average Plot to Time Series Plot in R
Setting the Color for an Individual Data Point
Detect Non Ascii Characters in a String
R Library for Discrete Markov Chain Simulation
How to Create a Pivot Table in R with Multiple (3+) Variables
Using R to "Click" a Download File Button on a Webpage
How to Create Base R Plot 'Type = B' Equivalent in Ggplot2
Subtracting Values Group-Wise by the Average of Each Group in R
Calculate Sum of a List of Variables by Group
Asterisk (*) VS. Colon (:) in R Formulas
How to Generate Ascii "Graphical Output" from R
R: Split Variable Column into Multiple (Unbalanced) Columns by Comma