Fastest Way to Filter a Data.Frame List Column Contents in R/Rcpp

Fastest way to filter a data.frame list column contents in R / Rcpp

Setting aside any algorithmic improvements, the analogous data.table solution is automatically going to be faster because you won't have to copy the entire thing just to add a column:

library(data.table)
dt = as.data.table(df)  # or use setDT to convert in place

dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0]
#   id  vars newcol
#1:  2 a,b,c    b,c
#2:  3   b,c    b,c

You can also delete the original column (with basically 0 cost), by adding [, vars := NULL] at the end). Or you can simply overwrite the initial column if you don't need that info, i.e. dt[, vars := lapply(vars, setdiff, 'a')].

Now as far as algorithmic improvements go, assuming your id values are unique for each vars (and if not, add a new unique identifier), I think this is much faster and automatically takes care of the filtering:

dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), by = id]
#   id vars
#1:  2  b,c
#2:  3  b,c

To carry along the other columns, I think it's easiest to simply merge back:

dt[, othercol := 5:7]

# notice the keyby
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), keyby = id][dt, nomatch = 0]
#   id vars i.vars othercol
#1:  2  b,c  a,b,c        6
#2:  3  b,c    b,c        7

How to Filter Data Table Rows with condition on column of Type list() in R

You can use sapply function to check if any of the values in vals is in Product for each row:

vals = c("UG12210","UG10000-WISD")

dt[Period %chin% "2018-Q1" & sapply(Product, function(v) any(vals %chin% v))]

#            Id  Period                      Product
# 1: 1000797366 2018-Q1                 UG10000-WISD
# 2: 1000797366 2018-Q1 NX11100,UG10000-WISD,UG12210
# 3: 1000797366 2018-Q1         UG10000-WISD,UG12210
# 4: 1000797366 2018-Q1         UG10000-WISD,UG12210
# 5: 1000797366 2018-Q1                      UG12210

Rcpp subsetting rows of DataFrame

cppFunction('LogicalVector test(DataFrame x, StringVector level_of_species) {
  using namespace std;  
  StringVector sub = x["Species"];
  std::string level = Rcpp::as<std::string>(level_of_species[0]);
  Rcpp::LogicalVector ind(sub.size());
  for (int i = 0; i < sub.size(); i++){
      ind[i] = (sub[i] == level);
  }

  return(ind);
}')

xx=test(iris, "setosa")
> table(xx)
 xx
 FALSE  TRUE 
   100    50

Subsetting done!!! (i myself learnt a lot from this question..thanks!)

cppFunction('Rcpp::DataFrame test(DataFrame x, StringVector level_of_species) {
  using namespace std;  
  StringVector sub = x["Species"];
  std::string level = Rcpp::as<std::string>(level_of_species[0]);
  Rcpp::LogicalVector ind(sub.size());
  for (int i = 0; i < sub.size(); i++){
    ind[i] = (sub[i] == level);
  }

 // extracting each column into a vector
 Rcpp::NumericVector   SepalLength = x["Sepal.Length"];
 Rcpp::NumericVector   SepalWidth = x["Sepal.Width"];
 Rcpp::NumericVector PetalLength = x["Petal.Length"];
 Rcpp::NumericVector   PetalWidth = x["Petal.Width"];

 return Rcpp::DataFrame::create(Rcpp::Named("Sepal.Length")  = SepalLength[ind],
                                Rcpp::Named("Sepal.Width")  = SepalWidth[ind],
                                Rcpp::Named("Petal.Length")  = PetalLength[ind],
                                Rcpp::Named("Petal.Width")  = PetalWidth[ind]
);}')

yy=test(iris, "setosa")
> str(yy)
 'data.frame':  50 obs. of  4 variables:
 $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
 $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
 $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
 $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...

Match On (of Filter By) Column Contents in List

Using just dplyr, there's

df %>% filter(mapply(`%in%`, ID, IDS_VERIFIED))

If you are learning "tidyverse"-flavored R, there's also

library(purrr)
library(magrittr)
df %>% filter(map2_lgl(ID, IDS_VERIFIED, is_in))

By the way, analysis on list columns can be slow. (An example here.)

Filtering out rows according to the result of a function of variables

rowwise is slow, you should avoid using it if you can. The operation you are doing can be vectorized:

v <- c(9,11,33,43,44)
x1 <- x %>% 
           mutate_at(1:5, funs(. %in% v)) %>% 
           filter(rowSums(select(., 1:5)) == 3 & V6 != 13)

It's pretty fast even with the large data set you are providing:

system.time(x1 <- x %>% mutate_at(1:5, funs(. %in% v)) %>% filter(rowSums(select(., 1:5)) == 3 & V6 != 13))

#   user  system elapsed 
#  3.561   0.807   4.465

Cannot pass a list to dplyr filter using %in%

You need pull instead of as.list() because you want to filter on a vector instead of a list:

# get D_ID's associated with O_ID == "A1"
x_A1 <- x %>%
  filter(O_ID == "A1") %>%
  select(D_ID) %>%
  pull()

# get table of coefficients for D_IDs
y_A1 <- y %>%
  filter(ID %in% x_A1)

Fastest Way to Filter a Data.Frame List Column Contents in R/Rcpp