R: Fast (Conditional) Subsetting Where Feasible

R: fast (conditional) subsetting where feasible

I agree with Konrad's answer that this should throw a warning or at least report what happens somehow. Here's a data.table way that will take advantage of indices (see package vignettes for details):

f = function(x, ..., verbose=FALSE){
L = substitute(list(...))[-1]
mon = data.table(cond = as.character(L))[, skip := FALSE]

for (i in seq_along(L)){
d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
if (nrow(d)){
x = d
} else {
mon[i, skip := TRUE]
}
}
print(mon)
return(x)
}

Usage

> f(dat, x > 119, y > 219, y > 1e6)
cond skip
1: x > 119 FALSE
2: y > 219 FALSE
3: y > 1e+06 TRUE
id x y z
1: 55 119.2634 219.0044 315.6556

The verbose option will print extra info provided by data.table package, so you can see when indices are being used. For example, with f(dat, x == 119, verbose=TRUE), I see it.

because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.).

If it's for non-interactive use, maybe better to have the function return list(mon = mon, x = x) to more easily keep track of what the query was and what happened. Also, the verbose console output could be captured and returned.

fast subsetting in R

One of the main issues is the matching of row names -- the default in [.data.frame is partial matching of row names and you probably don't want that, so you're better off with match. To speed it up even further you can use fmatch from fastmatch if you want. This is a minor modification with some speedup:

# naive
> system.time(res1 <- lapply(rows,function(r) dat[r,]))
user system elapsed
69.207 5.545 74.787

# match
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[match(r,rn),]))
user system elapsed
36.810 10.003 47.082

# fastmatch
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[fmatch(r,rn),]))
user system elapsed
19.145 3.012 22.226

You can get further speed up by not using [ (it is slow for data frames) but splitting the data frame (using split) if your rows are non-overlapping and cover all rows (and thus you can map each row to one entry in rows).

Depending on your actual data you may be better off with matrices that have by far faster subsetting operators since they are native.

is it possible to subset a data.frame based on a row range AND a logical condition in r?

You can do the subsetting using either of the way.

  1. Based on logical vector :
mtcars[seq(nrow(mtcars)) %in% 1:5 & mtcars$cyl==6,]

# mpg cyl disp hp drat wt qsec vs am gear carb
#Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
#Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
#Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1

  1. Based on row range :
mtcars[intersect(1:5, which(mtcars$cyl==6)),]

conditional data.table match for subset of data.table

An in place update of the colB in DT1 would work as follows:

DT1[is.na(colB), colB := DT2[DT1[is.na(colB)], 
on = .(colC = colA, timeB1 <= timeA, timeB2 >= timeA), colD]]
print(DT1)
colA colB timeA
1: 1 A 2
2: 1 YY 4
3: 2 AA 3
4: 2 B 4
5: 2 <NA> 6
6: 3 A 1
7: 3 C 4

This indexes the values where colB is NA and after a join on the condition, as defined in on= ..., replaces the missing values by the matching values found in colD.

R - fastest way to select the rows of a matrix that satisfy multiple conditions

Just use [ subsetting with logical comparison...

#  Reproducible data
set.seed(1)
m <- matrix( sample(12,28,repl=T) , 7 , 4 )
[,1] [,2] [,3] [,4]
[1,] 4 8 10 3
[2,] 5 8 6 8
[3,] 7 1 9 2
[4,] 11 3 12 4
[5,] 3 3 5 5
[6,] 11 9 10 1
[7,] 12 5 12 5

# Subset according to condition
m[ m[,2] == 3 & m[,3] == 12 , ]
[1] 11 3 12 4

Summarize all group values and a conditional subset in the same call

Writing up @hadley's comment as an answer

df_sqlite %>%
group_by(ID) %>%
mutate(Bfoo = if(A=="foo") B else 0) %>%
summarize(sumB = sum(B),
sumBfoo = sum(Bfoo)) %>%
collect

filter with dplyr based on a condition that only applies to a subset of data

easy peasy

df  %>% filter((ColA == "x") | (ColA == "y" & ColB == 1))


Related Topics



Leave a reply



Submit