R: fast (conditional) subsetting where feasible
I agree with Konrad's answer that this should throw a warning or at least report what happens somehow. Here's a data.table way that will take advantage of indices (see package vignettes for details):
f = function(x, ..., verbose=FALSE){
L = substitute(list(...))[-1]
mon = data.table(cond = as.character(L))[, skip := FALSE]
for (i in seq_along(L)){
d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
if (nrow(d)){
x = d
} else {
mon[i, skip := TRUE]
}
}
print(mon)
return(x)
}
Usage
> f(dat, x > 119, y > 219, y > 1e6)
cond skip
1: x > 119 FALSE
2: y > 219 FALSE
3: y > 1e+06 TRUE
id x y z
1: 55 119.2634 219.0044 315.6556
The verbose option will print extra info provided by data.table package, so you can see when indices are being used. For example, with f(dat, x == 119, verbose=TRUE)
, I see it.
because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.).
If it's for non-interactive use, maybe better to have the function return list(mon = mon, x = x)
to more easily keep track of what the query was and what happened. Also, the verbose console output could be captured and returned.
fast subsetting in R
One of the main issues is the matching of row names -- the default in [.data.frame
is partial matching of row names and you probably don't want that, so you're better off with match
. To speed it up even further you can use fmatch
from fastmatch
if you want. This is a minor modification with some speedup:
# naive
> system.time(res1 <- lapply(rows,function(r) dat[r,]))
user system elapsed
69.207 5.545 74.787
# match
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[match(r,rn),]))
user system elapsed
36.810 10.003 47.082
# fastmatch
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[fmatch(r,rn),]))
user system elapsed
19.145 3.012 22.226
You can get further speed up by not using [
(it is slow for data frames) but splitting the data frame (using split
) if your rows
are non-overlapping and cover all rows (and thus you can map each row to one entry in rows).
Depending on your actual data you may be better off with matrices that have by far faster subsetting operators since they are native.
is it possible to subset a data.frame based on a row range AND a logical condition in r?
You can do the subsetting using either of the way.
- Based on logical vector :
mtcars[seq(nrow(mtcars)) %in% 1:5 & mtcars$cyl==6,]
# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
- Based on row range :
mtcars[intersect(1:5, which(mtcars$cyl==6)),]
conditional data.table match for subset of data.table
An in place update of the colB
in DT1
would work as follows:
DT1[is.na(colB), colB := DT2[DT1[is.na(colB)],
on = .(colC = colA, timeB1 <= timeA, timeB2 >= timeA), colD]]
print(DT1)
colA colB timeA
1: 1 A 2
2: 1 YY 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4
This indexes the values where colB
is NA
and after a join on the condition, as defined in on= ...
, replaces the missing values by the matching values found in colD
.
R - fastest way to select the rows of a matrix that satisfy multiple conditions
Just use [
subsetting with logical comparison...
# Reproducible data
set.seed(1)
m <- matrix( sample(12,28,repl=T) , 7 , 4 )
[,1] [,2] [,3] [,4]
[1,] 4 8 10 3
[2,] 5 8 6 8
[3,] 7 1 9 2
[4,] 11 3 12 4
[5,] 3 3 5 5
[6,] 11 9 10 1
[7,] 12 5 12 5
# Subset according to condition
m[ m[,2] == 3 & m[,3] == 12 , ]
[1] 11 3 12 4
Summarize all group values and a conditional subset in the same call
Writing up @hadley's comment as an answer
df_sqlite %>%
group_by(ID) %>%
mutate(Bfoo = if(A=="foo") B else 0) %>%
summarize(sumB = sum(B),
sumBfoo = sum(Bfoo)) %>%
collect
filter with dplyr based on a condition that only applies to a subset of data
easy peasy
df %>% filter((ColA == "x") | (ColA == "y" & ColB == 1))
Related Topics
Replace All Values Lower Than Threshold in R
Get Value of Last Non-Na Row Per Column in Data.Table
Plot a Function with Several Arguments in R
Remove the Columns with the Colsums=0
Using Jupyter R Kernel with Visual Studio Code
How to Get Mean of Every N Rows and Keep the Date Index
Collapse/Concatenate/Aggregate Multiple Columns to a Single Comma Separated String Within Each Group
Error with H2O in R - Can't Connect to Local Host
Click on Cross Domain Iframe Element Using Rselenium
Add a Constant Value to All Rows in a Dataframe
How to Log Transform the Y-Axis of R Geom_Histogram in the Right Direction
R Function That Uses Its Output as Its Own Input Repeatedly
How to Unlock Environment in R
Convert Month's Number to Month Name
Finding Number of Elements in One Vector That Are Less Than an Element in Another Vector