Finding Elements That Do Not Overlap Between Two Vectors

Finding elements that do not overlap between two vectors

Yes, there is a way:

setdiff(list.a, list.b)
# [1] "Mary" "Jack" "Michelle"

How to find common elements from multiple vectors?

There might be a cleverer way to go about this, but

intersect(intersect(a,b),c)

will do the job.

EDIT: More cleverly, and more conveniently if you have a lot of arguments:

Reduce(intersect, list(a,b,c))

How to find elements common in at least 2 vectors?

It is much simpler than a lot of people are making it look. This should be very efficient.

  1. Put everything into a vector:

    x <- unlist(list(a, b, c, d, e))
  2. Look for duplicates

    unique(x[duplicated(x)])
    # [1] 2 3 1 4 8

and sort if needed.

Note: In case there can be duplicates within a list element (which your example does not seem to implicate), then replace x with x <- unlist(lapply(list(a, b, c, d, e), unique))


Edit: as the OP has expressed interest in a more general solution where n >= 2, I would do:

which(tabulate(x) >= n)

if the data is only made of natural integers (1, 2, etc.) as in the example. If not:

f <- table(x)
names(f)[f >= n]

This is now not too far from James solution but it avoids the costly-ish sort. And it is miles faster than computing all possible combinations.

Eliminating partially overlapping parts of 2 vectors in R

We create a function to pass the formula and the vector ('fmla', 'vec') respectively. Extract the variables from the 'fmla' (all.vars), find the values in the vector that are not found in the formula variables (setdiff), create a pattern by paste those variables and replace with blank ("") using sub, and update the 'vec', return the updated vector

fun1 <- function(fmla, vec) {

v1 <- all.vars(fmla)
v2 <- setdiff(vec, v1)
v3 <- sub(paste(v1, collapse = "|"), "", v2)
vec[vec %in% v2] <- v3
vec

}

-checking

> identical(fun1(f1, n1), desired_output)
[1] TRUE

Find elements NOT in the intersection of two lists

I usually prefer a shortcut:

set(a) ^ set(b)
{2, 4, 6}

R: finding intersection between two vectors

This a floating point error in r. See the Floating Point Guide for more information.

This can be seen as the error because this returns what you're looking for:

v1 = c(2, 2.01, 2.02, 2.03, 2.04, 2.05, 2.06, 2.07, 2.08, 2.09, 2.1, 
2.11, 2.12, 2.13, 2.14, 2.15, 2.16, 2.17, 2.18, 2.19, 2.2, 2.21,
2.22, 2.23, 2.24, 2.25, 2.26, 2.27, 2.28, 2.29, 2.3, 2.31, 2.32,
2.33, 2.34, 2.35, 2.36, 2.37, 2.38, 2.39, 2.4, 2.41, 2.42, 2.43,
2.44, 2.45, 2.46, 2.47, 2.48, 2.49, 2.5, 2.51, 2.52, 2.53, 2.54,
2.55, 2.56, 2.57, 2.58, 2.59, 2.6, 2.61, 2.62, 2.63, 2.64, 2.65,
2.66, 2.67, 2.68, 2.69, 2.7, 2.71, 2.72, 2.73, 2.74, 2.75, 2.76,
2.77, 2.78, 2.79, 2.8, 2.81, 2.82, 2.83, 2.84, 2.85, 2.86, 2.87,
2.88, 2.89, 2.9, 2.91, 2.92, 2.93, 2.94, 2.95, 2.96, 2.97, 2.98,
2.99)

v2 <- seq(2, 2.99, 0.01)

v1 <- round(v1,2) #rounds to 2 decimal places
v2 <- round(v2,2)

intersect(v1,v2) #returns v1

Return TRUE/FALSE if common elements/no common elements between vectors

Another option:

vector1 <- c(1,2,3)
vector2 <- c(4,5,6,1)

any(Reduce(intersect, list(vector1, vector2)))

Output:

[1] TRUE

How to get the proportion of elements that match between two vectors?

Here are some options with timings, also using @Sotos and @Henrik's suggestion from comments for comparison purposes.

library(microbenchmark)
library(data.table)

microbenchmark(a1 = table(vec2 %in% vec1)[[2]]/length(vec2) ,
a2 = sum(vec2 %in% vec1)/length(vec2),
a3 = sum(!is.na(match(vec2, vec1)))/length(vec2),
a4 = length(intersect(vec2, vec1)) / length(vec2),
a5 = sum(vec2 %chin% vec1)/length(vec2))

#Unit: milliseconds
# expr min lq mean median uq max neval
# a1 1269.84 1340.468 1667.251 1410.252 2191.750 2535.723 100
# a2 1022.26 1086.938 1284.692 1124.565 1152.516 2286.028 100
# a3 1023.59 1125.517 1387.592 1148.337 1852.645 3849.555 100
# a4 1022.84 1088.056 1291.582 1122.846 1173.768 2277.901 100
# a5 449.19 453.146 462.781 454.365 458.178 620.996 100

Clearly, Henrik's solution is the fastest.

data

set.seed(17)
vec1 <- paste0(sample(1:10, 10000000, replace = T), "_",
sample(1:1000000000, 10000000))
vec2 <- paste0(sample(1:10, 1000000, replace = T), "_",
sample(1:1000000000, 1000000))

Partial intersection of elements across vectors in two lists

There are some really simple/good answers, but they all seem to rely on unlist. I'm assuming that you need to preserve the grouping within matchlist, so unlisting them does not make sense. Here's a solution that works without that, using a double-lapply loop as you started to do:

out <- lapply(mylist, function(this) {
mtch <- lapply(matchlist, intersect, this)
wh <- which.max(lengths(mtch))
if (length(wh)) mtch[[wh]] else character(0)
})
str(out)
# List of 9
# $ PP : chr "OMITTED"
# $ IN01: chr "OMITTED"
# $ RD1 : chr [1:3] "NOT REACHED" "INVALID" "OMITTED"
# $ LOS : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ COM : chr(0)
# $ VR1 : chr "OMITTED"
# $ INF : chr [1:2] "NOT APPLICABLE" "OMITTED"
# $ IST : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ CMP : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"

It always returns a vector with the most number of matches, but if there are (somehow) more than one, I think it will preserve the natural order and return the first of said long-matches. (The question there is: "does which.max preserve natural order?" I think it does but have not verified.)

UPDATE

The constraint was added that not only the presence and order of the matchlist vectors was required, but also that there are no interloping words. For instance, if as suggested in the comments, mylist$RD1 has "BLAH", then it will not longer match with matchlist[[5]].

Checking for a perfectly-ordered subset of one vector to another is a bit more problematic (and therefore not a code-golf champion), and often scales poorly because we don't have easy subset determination. With that caveat, this implementation does some nested *apply functions ...

(NB: it was suggested in a comment that $RD1 should return character(0), but it does have "INVALID" which matches one of the single-length components of matchlist, so it should match, just not the longer one.)

out <- lapply(mylist, function(this) {
ind <- lapply(matchlist, function(a) which(this == a[1]))
perfectmatches <- mapply(function(ml, allis, this) {
length(ml) * any(sapply(allis, function(i) all(ml == this[ i + seq_along(ml) - 1 ])))
}, matchlist, ind, MoreArgs = list(this=this))
if (any(perfectmatches) > 0) {
wh <- which.max(perfectmatches)
return(matchlist[[wh]])
} else return(character(0))
})
str(out)
# List of 9
# $ PP : chr "OMITTED"
# $ IN01: chr "OMITTED"
# $ RD1 : chr "INVALID"
# $ LOS : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ COM : chr(0)
# $ VR1 : chr "OMITTED"
# $ INF : chr [1:2] "NOT APPLICABLE" "OMITTED"
# $ IST : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"
# $ CMP : chr [1:2] "LOGICALLY NOT APPLICABLE" "OMITTED"


Related Topics



Leave a reply



Submit