Finding elements that do not overlap between two vectors
Yes, there is a way:
setdiff(list.a, list.b)
# [1] "Mary" "Jack" "Michelle"
How to find common elements from multiple vectors?
There might be a cleverer way to go about this, but
intersect(intersect(a,b),c)
will do the job.
EDIT: More cleverly, and more conveniently if you have a lot of arguments:
Reduce(intersect, list(a,b,c))
How to find elements common in at least 2 vectors?
It is much simpler than a lot of people are making it look. This should be very efficient.
Put everything into a vector:
x <- unlist(list(a, b, c, d, e))
Look for duplicates
unique(x[duplicated(x)])
# [1] 2 3 1 4 8
and sort
if needed.
Note: In case there can be duplicates within a list element (which your example does not seem to implicate), then replace x
with x <- unlist(lapply(list(a, b, c, d, e), unique))
Edit: as the OP has expressed interest in a more general solution where n >= 2, I would do:
which(tabulate(x) >= n)
if the data is only made of natural integers (1, 2, etc.) as in the example. If not:
f <- table(x)
names(f)[f >= n]
This is now not too far from James solution but it avoids the costly-ish sort
. And it is miles faster than computing all possible combinations.
Eliminating partially overlapping parts of 2 vectors in R
We create a function to pass the formula and the vector ('fmla', 'vec') respectively. Extract the variables from the 'fmla' (all.vars
), find the values in the vector that are not found in the formula variables (setdiff
), create a pattern by paste
those variables and replace with blank (""
) using sub
, and update the 'vec', return the updated vector
fun1 <- function(fmla, vec) {
v1 <- all.vars(fmla)
v2 <- setdiff(vec, v1)
v3 <- sub(paste(v1, collapse = "|"), "", v2)
vec[vec %in% v2] <- v3
vec
}
-checking
> identical(fun1(f1, n1), desired_output)
[1] TRUE
Find elements NOT in the intersection of two lists
I usually prefer a shortcut:
set(a) ^ set(b)
{2, 4, 6}
R: finding intersection between two vectors
This a floating point error in r. See the Floating Point Guide for more information.
This can be seen as the error because this returns what you're looking for:
v1 = c(2, 2.01, 2.02, 2.03, 2.04, 2.05, 2.06, 2.07, 2.08, 2.09, 2.1,
2.11, 2.12, 2.13, 2.14, 2.15, 2.16, 2.17, 2.18, 2.19, 2.2, 2.21,
2.22, 2.23, 2.24, 2.25, 2.26, 2.27, 2.28, 2.29, 2.3, 2.31, 2.32,
2.33, 2.34, 2.35, 2.36, 2.37, 2.38, 2.39, 2.4, 2.41, 2.42, 2.43,
2.44, 2.45, 2.46, 2.47, 2.48, 2.49, 2.5, 2.51, 2.52, 2.53, 2.54,
2.55, 2.56, 2.57, 2.58, 2.59, 2.6, 2.61, 2.62, 2.63, 2.64, 2.65,
2.66, 2.67, 2.68, 2.69, 2.7, 2.71, 2.72, 2.73, 2.74, 2.75, 2.76,
2.77, 2.78, 2.79, 2.8, 2.81, 2.82, 2.83, 2.84, 2.85, 2.86, 2.87,
2.88, 2.89, 2.9, 2.91, 2.92, 2.93, 2.94, 2.95, 2.96, 2.97, 2.98,
2.99)
v2 <- seq(2, 2.99, 0.01)
v1 <- round(v1,2) #rounds to 2 decimal places
v2 <- round(v2,2)
intersect(v1,v2) #returns v1
Return TRUE/FALSE if common elements/no common elements between vectors
Another option:
vector1 <- c(1,2,3)
vector2 <- c(4,5,6,1)
any(Reduce(intersect, list(vector1, vector2)))
Output:
[1] TRUE
How to get the proportion of elements that match between two vectors?
Here are some options with timings, also using @Sotos and @Henrik's suggestion from comments for comparison purposes.
library(microbenchmark)
library(data.table)
microbenchmark(a1 = table(vec2 %in% vec1)[[2]]/length(vec2) ,
a2 = sum(vec2 %in% vec1)/length(vec2),
a3 = sum(!is.na(match(vec2, vec1)))/length(vec2),
a4 = length(intersect(vec2, vec1)) / length(vec2),
a5 = sum(vec2 %chin% vec1)/length(vec2))
#Unit: milliseconds
# expr min lq mean median uq max neval
# a1 1269.84 1340.468 1667.251 1410.252 2191.750 2535.723 100
# a2 1022.26 1086.938 1284.692 1124.565 1152.516 2286.028 100
# a3 1023.59 1125.517 1387.592 1148.337 1852.645 3849.555 100
# a4 1022.84 1088.056 1291.582 1122.846 1173.768 2277.901 100
# a5 449.19 453.146 462.781 454.365 458.178 620.996 100
Clearly, Henrik's solution is the fastest.
data
set.seed(17)
vec1 <- paste0(sample(1:10, 10000000, replace = T), "_",
sample(1:1000000000, 10000000))
vec2 <- paste0(sample(1:10, 1000000, replace = T), "_",
sample(1:1000000000, 1000000))
Partial intersection of elements across vectors in two lists
There are some really simple/good answers, but they all seem to rely on unlist
. I'm assuming that you need to preserve the grouping within matchlist
, so unlisting them does not make sense. Here's a solution that works without that, using a double-lapply
loop as you started to do:
out <- lapply(mylist, function(this) {
mtch <- lapply(matchlist, intersect, this)
wh <- which.max(lengths(mtch))
if (length(wh)) mtch[[wh]] else character(0)
})
str(out)
# List of 9
# $ PP : chr "OMITTED"
# $ IN01: chr "OMITTED"
# $ RD1 : chr [1:3] "NOT REACHED" "INVALID" "OMITTED"
# $ LOS : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ COM : chr(0)
# $ VR1 : chr "OMITTED"
# $ INF : chr [1:2] "NOT APPLICABLE" "OMITTED"
# $ IST : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ CMP : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
It always returns a vector with the most number of matches, but if there are (somehow) more than one, I think it will preserve the natural order and return the first of said long-matches. (The question there is: "does which.max
preserve natural order?" I think it does but have not verified.)
UPDATE
The constraint was added that not only the presence and order of the matchlist
vectors was required, but also that there are no interloping words. For instance, if as suggested in the comments, mylist$RD1
has "BLAH"
, then it will not longer match with matchlist[[5]]
.
Checking for a perfectly-ordered subset of one vector to another is a bit more problematic (and therefore not a code-golf champion), and often scales poorly because we don't have easy subset determination. With that caveat, this implementation does some nested *apply
functions ...
(NB: it was suggested in a comment that $RD1
should return character(0)
, but it does have "INVALID"
which matches one of the single-length components of matchlist
, so it should match, just not the longer one.)
out <- lapply(mylist, function(this) {
ind <- lapply(matchlist, function(a) which(this == a[1]))
perfectmatches <- mapply(function(ml, allis, this) {
length(ml) * any(sapply(allis, function(i) all(ml == this[ i + seq_along(ml) - 1 ])))
}, matchlist, ind, MoreArgs = list(this=this))
if (any(perfectmatches) > 0) {
wh <- which.max(perfectmatches)
return(matchlist[[wh]])
} else return(character(0))
})
str(out)
# List of 9
# $ PP : chr "OMITTED"
# $ IN01: chr "OMITTED"
# $ RD1 : chr "INVALID"
# $ LOS : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ COM : chr(0)
# $ VR1 : chr "OMITTED"
# $ INF : chr [1:2] "NOT APPLICABLE" "OMITTED"
# $ IST : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ CMP : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
Related Topics
Geom_Point() and Geom_Line() for Multiple Datasets on Same Graph in Ggplot2
Functions Available for Tufte Boxplots in R
Connecting Points with Lines in Ggplot2 in R
Extract Non Null Elements from a List in R
Replace Characters in Column Names Gsub
R: Why Does Read.Table Stop Reading a File
How to Plot the Relative Proportions of Two Groups Using a Fill Aesthetic in Ggplot2
How to Adjust Facet Size Manually
How to Create a List of Vectors in Rcpp
Error in Eval(Expr, Envir, Enclos):Object Not Found
Keeping Zero Count Combinations When Aggregating with Data.Table
Plotting Envfit Vectors (Vegan Package) in Ggplot2
Reading Excel File: How to Find the Start Cell in Messy Spreadsheets
Ggplot: Remove Na Factor Level in Legend
R: How to Display Clustered Matrix Heatmap (Similar Color Patterns Are Grouped)