Fastest way to filter a data.frame list column contents in R / Rcpp
Setting aside any algorithmic improvements, the analogous data.table
solution is automatically going to be faster because you won't have to copy the entire thing just to add a column:
library(data.table)
dt = as.data.table(df) # or use setDT to convert in place
dt[, newcol := lapply(vars, setdiff, 'a')][sapply(newcol, length) != 0]
# id vars newcol
#1: 2 a,b,c b,c
#2: 3 b,c b,c
You can also delete the original column (with basically 0 cost), by adding [, vars := NULL]
at the end). Or you can simply overwrite the initial column if you don't need that info, i.e. dt[, vars := lapply(vars, setdiff, 'a')]
.
Now as far as algorithmic improvements go, assuming your id
values are unique for each vars
(and if not, add a new unique identifier), I think this is much faster and automatically takes care of the filtering:
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), by = id]
# id vars
#1: 2 b,c
#2: 3 b,c
To carry along the other columns, I think it's easiest to simply merge back:
dt[, othercol := 5:7]
# notice the keyby
dt[, unlist(vars), by = id][!V1 %in% 'a', .(vars = list(V1)), keyby = id][dt, nomatch = 0]
# id vars i.vars othercol
#1: 2 b,c a,b,c 6
#2: 3 b,c b,c 7
How to Filter Data Table Rows with condition on column of Type list() in R
You can use sapply
function to check if any of the values in vals
is in Product
for each row:
vals = c("UG12210","UG10000-WISD")
dt[Period %chin% "2018-Q1" & sapply(Product, function(v) any(vals %chin% v))]
# Id Period Product
# 1: 1000797366 2018-Q1 UG10000-WISD
# 2: 1000797366 2018-Q1 NX11100,UG10000-WISD,UG12210
# 3: 1000797366 2018-Q1 UG10000-WISD,UG12210
# 4: 1000797366 2018-Q1 UG10000-WISD,UG12210
# 5: 1000797366 2018-Q1 UG12210
Rcpp subsetting rows of DataFrame
cppFunction('LogicalVector test(DataFrame x, StringVector level_of_species) {
using namespace std;
StringVector sub = x["Species"];
std::string level = Rcpp::as<std::string>(level_of_species[0]);
Rcpp::LogicalVector ind(sub.size());
for (int i = 0; i < sub.size(); i++){
ind[i] = (sub[i] == level);
}
return(ind);
}')
xx=test(iris, "setosa")
> table(xx)
xx
FALSE TRUE
100 50
Subsetting done!!! (i myself learnt a lot from this question..thanks!)
cppFunction('Rcpp::DataFrame test(DataFrame x, StringVector level_of_species) {
using namespace std;
StringVector sub = x["Species"];
std::string level = Rcpp::as<std::string>(level_of_species[0]);
Rcpp::LogicalVector ind(sub.size());
for (int i = 0; i < sub.size(); i++){
ind[i] = (sub[i] == level);
}
// extracting each column into a vector
Rcpp::NumericVector SepalLength = x["Sepal.Length"];
Rcpp::NumericVector SepalWidth = x["Sepal.Width"];
Rcpp::NumericVector PetalLength = x["Petal.Length"];
Rcpp::NumericVector PetalWidth = x["Petal.Width"];
return Rcpp::DataFrame::create(Rcpp::Named("Sepal.Length") = SepalLength[ind],
Rcpp::Named("Sepal.Width") = SepalWidth[ind],
Rcpp::Named("Petal.Length") = PetalLength[ind],
Rcpp::Named("Petal.Width") = PetalWidth[ind]
);}')
yy=test(iris, "setosa")
> str(yy)
'data.frame': 50 obs. of 4 variables:
$ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
$ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
$ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
$ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
Match On (of Filter By) Column Contents in List
Using just dplyr, there's
df %>% filter(mapply(`%in%`, ID, IDS_VERIFIED))
If you are learning "tidyverse"-flavored R, there's also
library(purrr)
library(magrittr)
df %>% filter(map2_lgl(ID, IDS_VERIFIED, is_in))
By the way, analysis on list columns can be slow. (An example here.)
Filtering out rows according to the result of a function of variables
rowwise
is slow, you should avoid using it if you can. The operation you are doing can be vectorized:
v <- c(9,11,33,43,44)
x1 <- x %>%
mutate_at(1:5, funs(. %in% v)) %>%
filter(rowSums(select(., 1:5)) == 3 & V6 != 13)
It's pretty fast even with the large data set you are providing:
system.time(x1 <- x %>% mutate_at(1:5, funs(. %in% v)) %>% filter(rowSums(select(., 1:5)) == 3 & V6 != 13))
# user system elapsed
# 3.561 0.807 4.465
Cannot pass a list to dplyr filter using %in%
You need pull
instead of as.list()
because you want to filter on a vector
instead of a list
:
# get D_ID's associated with O_ID == "A1"
x_A1 <- x %>%
filter(O_ID == "A1") %>%
select(D_ID) %>%
pull()
# get table of coefficients for D_IDs
y_A1 <- y %>%
filter(ID %in% x_A1)
Related Topics
Twitter Sentiment Analysis W R Using German Language Set Sentiws
How to Add Colorbar with Perspective Plot in R
Different Y-Limits on Ggplot Facet Grid Bar Graph
How to Plot a Stacked Bar with Ggplot
How to Get the First 10 Words in a String in R
Mutating Dummy Variables in Dplyr
Transfer Data from Database to Spark Using Sparklyr
Pull Nth Day of Month in Xts in R
Rstudio Shiny Not Able to Use Ggvis
Installing R Packages Error in Readrds(File):Error Reading from Connection
How to Get a List of All Possible Partitions of a Vector in R
Plot Table Objects with Ggplot
How Does R Handle Object in Function Call
Mapping the Shortest Flight Path Across the Date Line in R Leaflet/Shiny, Using Gcintermediate