Find Elements Not in Smaller Character Vector List But in Big List

Find elements not in smaller character vector list but in big list

Look at help("%in%") - there's an example all the way at the bottom of that page that addresses this situation.

A <- c("A", "B", "C", "D")
B <- c("A", "B", "C")
(new <- A[which(!A %in% B)])

# [1] "D"

EDIT:

As Tyler points out, I should take my own advice and read the support documents. which() is unnecessary when using %in% for this example. So,

(new <- A[!A %in% B])

# [1] "D"

selecting elements bigger than a particular number in a list

You can use Filter :

Listsubset <- Filter(function(x) x$n > 10, BigList)

Or an alternative with sapply :

Listsubset <- BigList[sapply(BigList, `[[`, 'n') > 10]

Finding elements that do not overlap between two vectors

Yes, there is a way:

setdiff(list.a, list.b)
# [1] "Mary" "Jack" "Michelle"

In R, find elements of a vector in a list using vectorization

we can do this, seems to be the fastest by far.

v1 <- c(1, 200, 4000)
L1 <- list(1:4, 1:4*100, 1:4*1000)

sequence(lengths(L1))[match(v1, unlist(L1))]
# [1] 1 2 4
sequence(lengths(L1))[which(unlist(L1) %in% v1)]
# [1] 1 2 4

library(microbenchmark)
library(tidyverse)

microbenchmark(
akrun_sapply = {sapply(L1, function(x) which(x %in% v1))},
akrun_Vectorize = {Vectorize(function(x) which(x %in% v1))(L1)},
akrun_mapply = {mapply(function(x, y) which(x %in% y), L1, v1)},
akrun_mapply_match = {mapply(match, v1, L1)},
akrun_map2 = {purrr::map2_int(L1, v1, ~ .x %in% .y %>% which)},
CPak = {setNames(rep(1:length(L1), times=lengths(L1)), unlist(L1))[as.character(v1)]},
zacdav = {sequence(lengths(L1))[match(v1, unlist(L1))]},
zacdav_which = {sequence(lengths(L1))[which(unlist(L1) %in% v1)]},
times = 10000
)

Unit: microseconds
expr min lq mean median uq max neval
akrun_sapply 18.187 22.7555 27.17026 24.6140 27.8845 2428.194 10000
akrun_Vectorize 60.119 76.1510 88.82623 83.4445 89.9680 2717.420 10000
akrun_mapply 19.006 24.2100 29.78381 26.2120 29.9255 2911.252 10000
akrun_mapply_match 14.136 18.4380 35.45528 20.0275 23.6560 127960.324 10000
akrun_map2 217.209 264.7350 303.64609 277.5545 298.0455 9204.243 10000
CPak 15.741 19.7525 27.31918 24.7150 29.0340 235.245 10000
zacdav 6.649 9.3210 11.30229 10.4240 11.5540 2399.686 10000
zacdav_which 7.364 10.2395 12.22632 11.2985 12.4515 2492.789 10000

Using R, How to use a character vector to search for matches in a very large character vector

grep and family only allow a single pattern= in their call, but one can use Vectorize to help with this:

out <- Vectorize(grepl, vectorize.args = "pattern")(Cities, Locations)
rownames(out) <- Locations
out
# New York San Francisco Austin
# San Antonio/TX FALSE FALSE FALSE
# Austin/TX FALSE FALSE TRUE
# Boston/MA FALSE FALSE FALSE

(I added rownames(.) purely to identify columns/rows from the source data.)

With this, if you want to know which index points where, then you can do

apply(out, 1, function(z) which(z)[1])
# San Antonio/TX Austin/TX Boston/MA
# NA 3 NA
apply(out, 2, function(z) which(z)[1])
# New York San Francisco Austin
# NA NA 2

The first indicates the index within Cities that apply to each specific location. The second indicates the index within Locations that apply to each of Cities. Both of these methods assume that there is at most a 1-to-1 matching; if there are ever more, the which(z)[1] will hide the 2nd and subsequent, which is likely not a good thing.

How to tell what is in one vector and not another?

you can use the setdiff() (set difference) function:

> setdiff(x, y)
[1] 1

See which vector in a list is contained within a vector from another list (finding people's name matches)

Since you are dealing with lists it would be better to collapse them into vectors to be easy to deal with regular expressions. But you just arrange them in ascending order. In that case you can easily match them:

lst=sapply(first_last_names_list,function(x)paste0(sort(x),collapse=" "))
lst1=gsub("\\s|$",".*",lst)
lst2=sapply(full_names_list,function(x)paste(sort(x),collapse=" "))
(lst3 = Vectorize(grep)(lst1,list(lst2),value=T,ignore.case=T))
boy.*boy.* bob.*orengo.* kalonzo.*musyoka.* anami.*lisamula.*
"boy boy juma" "bob james orengo" "kalonzo musyoka stephen" "anami lisamula silverse"

Now if you want to link first_name_last_name_list and full_name_list then:

setNames(full_names_list[ match(lst3,lst2)],sapply(first_last_names_list[grep(paste0(names(lst3),collapse = "|"),lst1)],paste,collapse=" "))
$`boy boy`
[1] "boy" "juma" "boy"

$`bob orengo`
[1] "james" "bob" "orengo"

$`kalonzo musyoka`
[1] "stephen" "kalonzo" "musyoka"

$`anami lisamula`
[1] "lisamula" "silverse" "anami"

where the names are from first_last_list and the elements are full_name_list. It would be great for you to deal with character vectors rather than lists:

Combine a list of similar length vectors with NAs to one vector

Here's a vectorised version of your code :

dat <- do.call(cbind, x)
#Logical matrix
mat <- !is.na(dat)
#Number of non-NA's in each row
rs <- rowSums(mat)
#First non-NA value
val <- dat[cbind(1:nrow(dat), max.col(mat, ties.method = 'first'))]
#More than 1 non-NA value
val[rs > 1] <- 'conflict'
#Only NA value
val[rs == 0] <- 'none'
val

#[1] "A" "A" "A" "A" "Conflict" "B"
#[7] "B" "B" "B" "none"

EDIT - Updated to include suggestion from @Henrik to avoid nested ifelse which should make the solution faster.

conditional removing from a vector without if statement in R

Use setdiff

setdiff(a1, "out")
#[1] "bagh" "bir"

setdiff(a2, "out")
#[1] "bagh" "bir"

%in% would work as well if we don't use which

a1[!a1 %in% "out"]
a2[!a2 %in% "out"]


Related Topics



Leave a reply



Submit