Faster Way to Subset on Rows of a Data Frame in R

What's the fastest way to subset a data.table?

What's the fastest way to subset a data.table?

Using the binary search based subset feature is the fastest. Note that the subset requires the option nomatch = 0L so as to return only the matching results.

How to subset by one of the keys only with two keys set?

If you've two keys set on DT and you want to subset by the first key, then you can just provide the first value in J(.), no need to provide anything for the 2nd key. That is:

# will return all columns where the first key column matches 22
DT[J(22), nomatch=0L]

If instead, you would like to subset by the second key, then you'll have to, as of now, provide all the unique values for the first key. That is:

# will return all columns where 2nd key column matches 2
DT[J(unique(V1), 2), nomatch=0L]

This is also shown in this SO post. Although I'd prefer that DT[J(, 2)] to work for this case, as that seems rather intuitive.

There's also a pending feature request, FR #1007 for implementing secondary keys, which when done would take care of this.

Here is a better example:

DT = data.table(c(1,2,3,4,5), c(2,3,2,3,2))
DT
# V1 V2
# 1: 1 2
# 2: 2 3
# 3: 3 2
# 4: 4 3
# 5: 5 2
setkey(DT,V1,V2)
DT[J(unique(V1),2)]
# V1 V2
# 1: 1 2
# 2: 2 2
# 3: 3 2
# 4: 4 2
# 5: 5 2
DT[J(unique(V1),2), nomatch=0L]
# V1 V2
# 1: 1 2
# 2: 3 2
# 3: 5 2
DT[J(3), nomatch=0L]
# V1 V2
# 1: 3 2

In summary:

# key(DT) = c("V1", "V2")

# data.frame | data.table equivalent
# =====================================================================
# subset(DF, (V1 == 3) & (V2 == 2)) | DT[J(3,2), nomatch=0L]
# subset(DF, (V1 == 3)) | DT[J(3), nomatch=0L]
# subset(DF, (V2 == 2)) | DT[J(unique(V1), 2), nomatch=0L]

Is there a way to speed up subsetting of smaller data.frames

You will see a performance boost by converting to matrices. This is a viable alternative if the whole content of your data.frame is numerical (or can be converted without too much trouble).

Here we go. First I modified the data to have it with size 200x30:

library(data.table)
nmax = 200
cmax = 30
set.seed(1)
x<-runif(nmax,min=0,max=10)
DF = data.frame(x)
for (i in 2:cmax) {
DF = cbind(DF, runif(nmax,min=0,max=10))
colnames(DF)[ncol(DF)] = paste0('x',i)
}
DT = data.table(DF)
DM = as.matrix(DF) # # # or data.matrix(DF) if you have factors

And the comparison, ranked from quickest to slowest:

summary(microbenchmark::microbenchmark(
DM[DM[, 'x']>5, ], # # # # Quickest
as.matrix(DF)[DF$x>5, ], # # # # Still quicker with conversion
DF[DF$x>5, ],
`[.data.frame`(DT,DT$x < 5,),
DT[x>5],
times = 100L, unit = "us"))

# expr min lq mean median uq max neval
# 1 DM[DM[, "x"] > 5, ] 13.883 19.8700 22.65164 22.4600 24.9100 41.107 100
# 2 as.matrix(DF)[DF$x > 5, ] 141.100 181.9140 196.02329 195.7040 210.2795 304.989 100
# 3 DF[DF$x > 5, ] 198.846 238.8085 260.07793 255.6265 278.4080 377.982 100
# 4 `[.data.frame`(DT, DT$x < 5, ) 212.342 268.2945 346.87836 289.5885 304.2525 5894.712 100
# 5 DT[x > 5] 322.695 396.3675 465.19192 428.6370 457.9100 4186.487 100

If your use-case involves querying multiple times the data, then you can do the conversion only once and increase the speed by one order of magnitude.

The fastest way to subset selected columns from a dataset with R

We can use intersect

subdf2 <- df2[intersect(names(df1), names(df2))]

R: fast (conditional) subsetting where feasible

I agree with Konrad's answer that this should throw a warning or at least report what happens somehow. Here's a data.table way that will take advantage of indices (see package vignettes for details):

f = function(x, ..., verbose=FALSE){
L = substitute(list(...))[-1]
mon = data.table(cond = as.character(L))[, skip := FALSE]

for (i in seq_along(L)){
d = eval( substitute(x[cond, verbose=v], list(cond = L[[i]], v = verbose)) )
if (nrow(d)){
x = d
} else {
mon[i, skip := TRUE]
}
}
print(mon)
return(x)
}

Usage

> f(dat, x > 119, y > 219, y > 1e6)
cond skip
1: x > 119 FALSE
2: y > 219 FALSE
3: y > 1e+06 TRUE
id x y z
1: 55 119.2634 219.0044 315.6556

The verbose option will print extra info provided by data.table package, so you can see when indices are being used. For example, with f(dat, x == 119, verbose=TRUE), I see it.

because I fear the if-then jungle would be rather slow, especially since I need to apply all of this to different data.tables within a list using lapply(.).

If it's for non-interactive use, maybe better to have the function return list(mon = mon, x = x) to more easily keep track of what the query was and what happened. Also, the verbose console output could be captured and returned.

fast subsetting in R

One of the main issues is the matching of row names -- the default in [.data.frame is partial matching of row names and you probably don't want that, so you're better off with match. To speed it up even further you can use fmatch from fastmatch if you want. This is a minor modification with some speedup:

# naive
> system.time(res1 <- lapply(rows,function(r) dat[r,]))
user system elapsed
69.207 5.545 74.787

# match
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[match(r,rn),]))
user system elapsed
36.810 10.003 47.082

# fastmatch
> rn <- rownames(dat)
> system.time(res1 <- lapply(rows,function(r) dat[fmatch(r,rn),]))
user system elapsed
19.145 3.012 22.226

You can get further speed up by not using [ (it is slow for data frames) but splitting the data frame (using split) if your rows are non-overlapping and cover all rows (and thus you can map each row to one entry in rows).

Depending on your actual data you may be better off with matrices that have by far faster subsetting operators since they are native.



Related Topics



Leave a reply



Submit