Data.Table in R - Multiple Filters Using Multiple Keys - Binary Search

data.table in R - multiple filters using multiple keys - binary search

The reason you didn't get an error from your query is that data.table will reuse values when they're multiples of other values. In other words, because the 1 for am can be used 2 times, it does this without telling you. If you were to do a query where the number of allowable values weren't multiples of each other then it would give you a warning. For example

DT[.(c(1,0),c(5,4,3),c(8,6,4))]

will give you a warning complaining about a remainder of 1 item, the same error you would see when typing data.table(c(1,0),c(5,4,3),c(8,6,4)). Whenever merging X[Y], both X and Y should be thought of as data.tables.

If you instead use CJ,

DT[CJ(c(1,0),c(5,4,3),c(8,6,4))]

then it will make every combination of all the values for you and data.table will give the results you expect.

From the vignette (bolding is mine):

What’s happening here? Read this again. The value provided for the
second key column “MIA” has to find the matching vlaues in dest key
column on the matching rows provided by the first key column origin.
We can not skip the values of key columns before. Therfore we provide
all unique values from key column origin. “MIA” is automatically
recycled to fit the length of unique(origin) which is 3.

Just for completeness, the vector scan syntax will work without using CJ

DT[am == 1 & gear == 4 & carb == 4]

or

DT[am == 1 & (gear == 3 | gear == 4) & (carb == 4 | carb == 2)]

How do you know if you need a binary search? If the speed of subsetting is unbearable then you need a binary search. For example, I've got a 48M row data.table I'm playing with and the difference between a binary search and a vector is staggering relative to one another. Specifically a vector scan takes 1.490 seconds in elapsed time but a binary search only takes 0.001 seconds. That, of course, assumes that I've already keyed the data.table. If I include the time it takes to set the key then the combination of setting the key and performing the subset is 1.628. So you have to pick your poison

Filter data.table on same condition for multiple columns

We could use get

sel.col <- "cyl"
dt[get(sel.col) == 4, ..sel.col]
# cyl gear
#1: 4 4

or eval(as.name)

dt[eval(as.name(sel.col)) == 4, ..sel.col]
# cyl gear
#1: 4 4

Earlier, we thought that there is only a single column to be evaluated. If we have more than one column, specify it in the .SDcols, loop through the Subset of Data.table (.SD) compare it with the value of interest ('4'), Reduce it to logical vector with | i.e. any TRUE in each of the rows and subset the rows based on this

dt[dt[, Reduce(`|`, lapply(.SD, `==`, 4)),.SDcols = sel.col], ..sel.col]

Subsetting a data.table by range making use of binary search

Interesting question. First let's look at the example data :

> print(DT)
x y
1: 2.607703e-07 5.748127
2: 8.894131e-07 5.233994
3: 1.098961e-06 9.834267
4: 1.548324e-06 2.016585
5: 1.569279e-06 7.957730
---
9999996: 9.999996e+00 9.977782
9999997: 9.999998e+00 2.666575
9999998: 9.999999e+00 6.869967
9999999: 9.999999e+00 1.953145
10000000: 1.000000e+01 4.001616
> length(DT$x)
[1] 10000000
> length(unique(DT$x))
[1] 9988478
> length(DT$y)
[1] 10000000
> length(unique(DT$y))
[1] 9988225
> DT[,.N,by=x][,table(N)]
N
1 2 3
9976965 11504 9
> DT[,.N,by="x,y"][,table(N)]
N
1
10000000
>

So there are almost 10 million unique floating point values in the first column: a few groups of size 2 and 3 rows but mostly 1 row groups. Once the second column is including, there are 10 million unique groups of size 1 row. This is quite a tough problem, since data.table is designed more for grouped data in mind; e.g, (id, date), (id1, id2, date, time) etc.

However, data.table and setkey do support floating point data in keys, so let's give it a go.

On my slow netbook :

> system.time(setkey(DT,x,y))
user system elapsed
7.097 0.520 7.650

> system.time(DT[x>5 & y<7])
user system elapsed
2.820 0.292 3.122

So the vector scanning approach is faster than setting the key (and we haven't even used the key yet). Given the data is floating point and almost unique then this isn't too surprising, but I think that's a pretty fast time for setkey to sort 10 million thoroughly random and almost unique doubles.

Compare to base for example, just sorting x not even y as well :

> system.time(base::order(x))
user system elapsed
72.445 0.292 73.072

Assuming this data is representative of your real data, and you don't want to do this just once but several times, so are willing to pay the price of setkey, the first step is pretty clear :

system.time(w <- DT[.(5),which=TRUE,roll=TRUE])
user system elapsed
0.004 0.000 0.003
> w
[1] 4999902

But here we're stuck. A next step like DT[(w+1):nrow(DT)] is ugly and copies. I can't think of a decent way to use the key from here to do the y<7 part as well. In other example data we do something like DT[.(unique(x), 7), which=TRUE, roll=TRUE] but in this case the data is so unique and floating point that's going to be slow.

Ideally, this task needs range joins (FR#203) implementing. The syntax in this example might be :

DT[.( c(5,Inf), c(-Inf,7) )]

or to make it easier, DT[x>5 & y<7] could be optimized to do that under the hood. Allowing a two-column range in i that joins to the corresponding x columns could be quite useful and has come up several times.

The speedups in v1.9.2 needed to be done first before we could move on to things like that. If you try setkey on this data in v1.8.10 you'll find that v1.9.2 is significantly faster.

See also :

How to self join a data.table on a condition

Remove a range in data.table

Subsetting a data.table with a binary search by range in two columns

My understanding is that rolling join uses binary search but only on the last joining key, hence it is not simultaneously possible to perform rolling join on 4 keys. In addition, your values are non-integer in nature and hence it is not possible to pinpoint the 4 corners using binary search.

Having said that, here are some options to speed up the subsetting with non-equi join being the fastest but I face some memory limitation issues with your dimensions:

m0 <- function()
lapply(seq_len(nrow(ranges)), function(i){
DT[x%between%c(ranges[i,]$x_min, ranges[i,]$x_max)&
y%between%c(ranges[i,]$y_min, ranges[i,]$y_max)]
})

m1 <- function()
ranges[, DT[x %between% c(x_min, x_max) & y %between% c(y_min, y_max)], 1L:nrow(ranges)]

m2 <- function() {
setkey(DT, x, y)
setDT(ranges, key=c("x_min", "x_max", "y_min", "y_max"))
DT[ranges, on=.(x>=x_min, x<=x_max, y>=y_min, y<=y_max), allow.cartesian=TRUE, .(x.x, x.y, x.z)]
}

m3 <- function() {
setkey(DT3, x)[, rn := .I]
ranges[, ixmin := DT3[.SD, on=.(x=x_min), roll=-Inf, rn]]
ranges[, ixmax := DT3[.SD, on=.(x=x_max), roll=Inf, rn]]

setkey(DT3, y)
DT3[DT3[ranges, on=.(y>=y_min, y<=y_max),
by=.EACHI, .(rn=rn[rn %between% c(ixmin, ixmax)])], on=.(rn),
.(x, y, z)]
}

microbenchmark::microbenchmark(times=1L, m0(), m1(), m2(), m3())

timings:

Unit: milliseconds
expr min lq mean median uq max neval
m0() 782.6070 782.6070 782.6070 782.6070 782.6070 782.6070 1
m1() 713.9469 713.9469 713.9469 713.9469 713.9469 713.9469 1
m2() 272.6018 272.6018 272.6018 272.6018 272.6018 272.6018 1
m3() 765.3667 765.3667 765.3667 765.3667 765.3667 765.3667 1

data:

library(data.table)
set.seed(0L)
nr <- 2e4L
nrng <- 1e3L
dat <- data.table(x=runif(nr), y=runif(nr), z=runif(nr))
ranges <- data.frame(x_min=runif(nrng, max = 0.5), x_max=runif(nrng, min = 0.5),
y_min=runif(nrng, max = 0.5), y_max=runif(nrng, min = 0.5))
dat[, rn := .I]

DT3 <- copy(dat)
DT <- copy(dat)

Assigning by reference in R data.table with i expression provided in string variable

Another way to achieve this is to use a join to select the rows:

library(data.table)
dt <- as.data.table(mtcars)
filterValues <- list(cyl = c(4,6),
gear = c(3),
carb = c(1))
dt[do.call(CJ, filterValues), on = names(filterValues), filtered := TRUE][]
     mpg cyl  disp  hp drat    wt  qsec vs am gear carb filtered
1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 NA
2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 NA
3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 NA
4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 TRUE
5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 NA
6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 TRUE
7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 NA
8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 NA
9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 NA
10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 NA
11: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 NA
12: 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 NA
13: 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 NA
14: 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 NA
15: 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 NA
16: 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 NA
17: 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 NA
18: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 NA
19: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 NA
20: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 NA
21: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 TRUE
22: 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 NA
23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 NA
24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 NA
25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 NA
26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 NA
27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 NA
28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 NA
29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 NA
30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 NA
31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 NA
32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 NA
mpg cyl disp hp drat wt qsec vs am gear carb filtered

or

dt <- as.data.table(mtcars)
dt[do.call(CJ, filterValues), on = names(filterValues), nomatch = 0L]
    mpg cyl  disp  hp drat    wt  qsec vs am gear carb
1: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
2: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
3: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1

You only need to specify the list of filterValues. do.call(CJ, filterValues) (cross join) creates a data.table with all combinations to select the rows by:

   cyl gear carb
1: 4 3 1
2: 6 3 1

Edit

The OP has asked if this could be extended to inequalities.

This can be done with data.table's non-equi joins but the setup is somewhat different. E.g.,

filterIntervals <- list(disp = c(200, 300),
mpg = c(10, 20))
mDT <- dcast(melt(filterIntervals), . ~ L1 + rowid(L1))
filterCondition <- c("disp>=disp_1", "disp<disp_2", "mpg>mpg_1", "mpg<mpg_2")
dt[mDT, on = filterCondition, filtered := TRUE][]
     mpg cyl  disp  hp drat    wt  qsec vs am gear carb filtered
1: 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 NA
2: 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 NA
3: 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 NA
4: 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 NA
5: 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2 NA
6: 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 TRUE
7: 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4 NA
8: 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2 NA
9: 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 NA
10: 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 NA
11: 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 NA
12: 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3 TRUE
13: 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3 TRUE
14: 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3 TRUE
15: 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4 NA
16: 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4 NA
17: 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 NA
18: 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 NA
19: 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 NA
20: 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 NA
21: 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 NA
22: 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2 NA
23: 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2 NA
24: 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4 NA
25: 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2 NA
26: 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 NA
27: 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 NA
28: 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 NA
29: 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 NA
30: 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 NA
31: 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 NA
32: 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 NA
mpg cyl disp hp drat wt qsec vs am gear carb filtered

data.table - Selecting by comparing list of columns to list of values

Assuming that the 'values' to be filtered are the ones corresponding to the 'columns' selected, we can do a comparison with Map and Reduce with &

dt[dt[ , Reduce(`&`, Map(`==`, .SD, values)) , .SDcols = columns]]

As a function

filterDT <- function(dt, columns, values) {
dt[dt[ , Reduce(`&`, Map(`==`, .SD, values)) , .SDcols = columns]]
}

filterDT(dt, c("V1", "V3"), c(1, 2))
# V1 V2 V3
#1: 1 4 2

Or another option is setkey

setkeyv(dt, c("V1", "V3"))
dt[.(1, 2)]
# V1 V2 V3
#1: 1 4 2

Round to a multiple and filter in data.table

It's a problem of floating point precision.
See DT[abs(a - 112.3)<1.e-6 & b == 'B',] using an error margin of 0.000001 will give you proper result.

If you want more precision you can use .Machine$double.eps^0.5 as does all.equal.

General advice is to never compare equality of floats but compare the difference with a value near enough to the machine precision to get around the precision drift between 0 and 1), more details here

One way to fix your problem could be to refactor your function to:

mround <- function (number, multiple, digits=nchar(strsplit(as.character(multiple),".",fixed=TRUE)[[1]][2])) {

round(multiple * round(number / multiple),digits)
}

I used a "convoluted" method to get the digits needed from the multiple passed as default significant digits, adapt to your needs (you may used 2 here for example, or force the precision when calling).

I removed the unnecessary return which just cause the interpreter to look for a function already called at end of the function call.

This way your output should be precise enough, but you'll still have corner cases:

> mround(112.37,0.2)
[1] 112.40000000000001

To use floats in joins, you can use (courtesy of David Arenburg):

setNumericRounding(1)
DT[.(112.3, 'B'), nomatch = 0L, on = .(a, b)]

What is the purpose of setting a key in data.table?

In addition to this answer, please refer to the vignettes Secondary indices and auto indexing and Keys and fast binary search based subset as well.

This issue highlights the other vignettes that we plan to.


I've updated this answer again (Feb 2016) in light of the new on= feature that allows ad-hoc joins as well. See history for earlier (outdated) answers.

What exactly does setkey(DT, a, b) do?

It does two things:

  1. reorders the rows of the data.table DT by the column(s) provided (a, b) by reference, always in increasing order.
  2. marks those columns as key columns by setting an attribute called sorted to DT.

The reordering is both fast (due to data.table's internal radix sorting) and memory efficient (only one extra column of type double is allocated).

When is setkey() required?

For grouping operations, setkey() was never an absolute requirement. That is, we can perform a cold-by or adhoc-by.

## "cold" by
require(data.table)
DT <- data.table(x=rep(1:5, each=2), y=1:10)
DT[, mean(y), by=x] # no key is set, order of groups preserved in result

However, prior to v1.9.6, joins of the form x[i] required key to be set on x. With the new on= argument from v1.9.6+, this is not true anymore, and setting keys is therefore not an absolute requirement here as well.

## joins using < v1.9.6 
setkey(X, a) # absolutely required
setkey(Y, a) # not absolutely required as long as 'a' is the first column
X[Y]

## joins using v1.9.6+
X[Y, on="a"]
# or if the column names are x_a and y_a respectively
X[Y, on=c("x_a" = "y_a")]

Note that on= argument can be explicitly specified even for keyed joins as well.

The only operation that requires key to be absolutely set is the foverlaps() function. But we are working on some more features which when done would remove this requirement.

  • So what's the reason for implementing on= argument?

    There are quite a few reasons.

    1. It allows to clearly distinguish the operation as an operation involving two data.tables. Just doing X[Y] does not distinguish this as well, although it could be clear by naming the variables appropriately.

    2. It also allows to understand the columns on which the join/subset is being performed immediately by looking at that line of code (and not having to traceback to the corresponding setkey() line).

    3. In operations where columns are added or updated by reference, on= operations are much more performant as it doesn't need the entire data.table to be reordered just to add/update column(s). For example,

       ## compare 
      setkey(X, a, b) # why physically reorder X to just add/update a column?
      X[Y, col := i.val]

      ## to
      X[Y, col := i.val, on=c("a", "b")]

      In the second case, we did not have to reorder. It's not computing the order that's time consuming, but physically reordering the data.table in RAM, and by avoiding it, we retain the original order, and it is also performant.

    4. Even otherwise, unless you're performing joins repetitively, there should be no noticeable performance difference between a keyed and ad-hoc joins.

This leads to the question, what advantage does keying a data.table have anymore?

  • Is there an advantage to keying a data.table?

    Keying a data.table physically reorders it based on those column(s) in RAM. Computing the order is not usually the time consuming part, rather the reordering itself. However, once we've the data sorted in RAM, the rows belonging to the same group are all contiguous in RAM, and is therefore very cache efficient. It's the sortedness that speeds up operations on keyed data.tables.

    It is therefore essential to figure out if the time spent on reordering the entire data.table is worth the time to do a cache-efficient join/aggregation. Usually, unless there are repetitive grouping / join operations being performed on the same keyed data.table, there should not be a noticeable difference.

In most cases therefore, there shouldn't be a need to set keys anymore. We recommend using on= wherever possible, unless setting key has a dramatic improvement in performance that you'd like to exploit.

Question: What do you think would be the performance like in comparison to a keyed join, if you use setorder() to reorder the data.table and use on=? If you've followed thus far, you should be able to figure it out :-).



Related Topics



Leave a reply



Submit