Fastest Way to Filter a Data.Frame List Column Contents in R/Rcpp

Fastest way to filter a data.frame list column contents in R / Rcpp

Setting aside any algorithmic improvements, the analogous data.table solution is automatically going to be faster because you won't have to copy the entire thing just to add a column:

library(data.table)
dt = as.data.table(df) # or use setDT to convert in place

dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0]
# id vars newcol
#1: 2 a,b,c b,c
#2: 3 b,c b,c

You can also delete the original column (with basically 0 cost), by adding [, vars := NULL] at the end). Or you can simply overwrite the initial column if you don't need that info, i.e. dt[, vars := lapply(vars, setdiff, 'a')].


Now as far as algorithmic improvements go, assuming your id values are unique for each vars (and if not, add a new unique identifier), I think this is much faster and automatically takes care of the filtering:

dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), by = id]
# id vars
#1: 2 b,c
#2: 3 b,c

To carry along the other columns, I think it's easiest to simply merge back:

dt[, othercol := 5:7]

# notice the keyby
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), keyby = id][dt, nomatch = 0]
# id vars i.vars othercol
#1: 2 b,c a,b,c 6
#2: 3 b,c b,c 7

How to Filter Data Table Rows with condition on column of Type list() in R

You can use sapply function to check if any of the values in vals is in Product for each row:

vals = c("UG12210","UG10000-WISD")

dt[Period %chin% "2018-Q1" & sapply(Product, function(v) any(vals %chin% v))]

# Id Period Product
# 1: 1000797366 2018-Q1 UG10000-WISD
# 2: 1000797366 2018-Q1 NX11100,UG10000-WISD,UG12210
# 3: 1000797366 2018-Q1 UG10000-WISD,UG12210
# 4: 1000797366 2018-Q1 UG10000-WISD,UG12210
# 5: 1000797366 2018-Q1 UG12210

Rcpp subsetting rows of DataFrame

cppFunction('LogicalVector test(DataFrame x, StringVector level_of_species) {
using namespace std;
StringVector sub = x["Species"];
std::string level = Rcpp::as<std::string>(level_of_species[0]);
Rcpp::LogicalVector ind(sub.size());
for (int i = 0; i < sub.size(); i++){
ind[i] = (sub[i] == level);
}

return(ind);
}')

xx=test(iris, "setosa")
> table(xx)
xx
FALSE TRUE
100 50

Subsetting done!!! (i myself learnt a lot from this question..thanks!)

cppFunction('Rcpp::DataFrame test(DataFrame x, StringVector level_of_species) {
using namespace std;
StringVector sub = x["Species"];
std::string level = Rcpp::as<std::string>(level_of_species[0]);
Rcpp::LogicalVector ind(sub.size());
for (int i = 0; i < sub.size(); i++){
ind[i] = (sub[i] == level);
}

// extracting each column into a vector
Rcpp::NumericVector SepalLength = x["Sepal.Length"];
Rcpp::NumericVector SepalWidth = x["Sepal.Width"];
Rcpp::NumericVector PetalLength = x["Petal.Length"];
Rcpp::NumericVector PetalWidth = x["Petal.Width"];

return Rcpp::DataFrame::create(Rcpp::Named("Sepal.Length") = SepalLength[ind],
Rcpp::Named("Sepal.Width") = SepalWidth[ind],
Rcpp::Named("Petal.Length") = PetalLength[ind],
Rcpp::Named("Petal.Width") = PetalWidth[ind]
);}')

yy=test(iris, "setosa")
> str(yy)
'data.frame': 50 obs. of 4 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

Match On (of Filter By) Column Contents in List

Using just dplyr, there's

df %>% filter(mapply(`%in%`, ID, IDS_VERIFIED))

If you are learning "tidyverse"-flavored R, there's also

library(purrr)
library(magrittr)
df %>% filter(map2_lgl(ID, IDS_VERIFIED, is_in))

By the way, analysis on list columns can be slow. (An example here.)

Filtering out rows according to the result of a function of variables

rowwise is slow, you should avoid using it if you can. The operation you are doing can be vectorized:

v <- c(9,11,33,43,44)
x1 <- x %>%
mutate_at(1:5, funs(. %in% v)) %>%
filter(rowSums(select(., 1:5)) == 3 & V6 != 13)

It's pretty fast even with the large data set you are providing:

system.time(x1 <- x %>% mutate_at(1:5, funs(. %in% v)) %>% filter(rowSums(select(., 1:5)) == 3 & V6 != 13))

# user system elapsed
# 3.561 0.807 4.465

Cannot pass a list to dplyr filter using %in%

You need pull instead of as.list() because you want to filter on a vector instead of a list:

# get D_ID's associated with O_ID == "A1"
x_A1 <- x %>%
filter(O_ID == "A1") %>%
select(D_ID) %>%
pull()

# get table of coefficients for D_IDs
y_A1 <- y %>%
filter(ID %in% x_A1)


Related Topics



Leave a reply



Submit