Subsetting Data.Table by 2Nd Column Only of a 2 Column Key, Using Binary Search Not Vector Scan

Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan

Yes, you can pass all values to the first key value and subset with the specific value for the second key.

DT[J(unique(x), 25), nomatch=0]

If you need to subset by more than one value in the second key (e.g. the equivalent of DT[y %in% 25:24]), a more general solution is to use CJ

DT[CJ(unique(x), 25:24), nomatch=0]

Note that CJ by default sorts the columns and sets key to all the columns, which means the result would be sorted as well. If that's not desirable, you should use sorted=FALSE

DT[CJ(unique(x), 25:24, sorted=FALSE), nomatch=0]

There's also a feature request to add secondary keys to data.table in future. I believe the plan is to add a new function set2key.

FR#1007 Build in secondary keys

There is also merge, which has a method for data.table. It builds the secondary key inside it for you, so should be faster than base merge. See ?merge.data.table.

Subsetting a data.table with a binary search by range in two columns

My understanding is that rolling join uses binary search but only on the last joining key, hence it is not simultaneously possible to perform rolling join on 4 keys. In addition, your values are non-integer in nature and hence it is not possible to pinpoint the 4 corners using binary search.

Having said that, here are some options to speed up the subsetting with non-equi join being the fastest but I face some memory limitation issues with your dimensions:

m0 <- function()
    lapply(seq_len(nrow(ranges)), function(i){
        DT[x%between%c(ranges[i,]$x_min, ranges[i,]$x_max)&
                y%between%c(ranges[i,]$y_min, ranges[i,]$y_max)]
    })

m1 <- function()
    ranges[, DT[x %between% c(x_min, x_max) & y %between% c(y_min, y_max)], 1L:nrow(ranges)]

m2 <- function() {
    setkey(DT, x, y)
    setDT(ranges, key=c("x_min", "x_max", "y_min", "y_max"))
    DT[ranges, on=.(x>=x_min, x<=x_max, y>=y_min, y<=y_max), allow.cartesian=TRUE, .(x.x, x.y, x.z)]
}

m3 <- function() {
    setkey(DT3, x)[, rn := .I]
    ranges[, ixmin := DT3[.SD, on=.(x=x_min), roll=-Inf, rn]]
    ranges[, ixmax := DT3[.SD, on=.(x=x_max), roll=Inf, rn]]

    setkey(DT3, y)
    DT3[DT3[ranges, on=.(y>=y_min, y<=y_max),
        by=.EACHI, .(rn=rn[rn %between% c(ixmin, ixmax)])], on=.(rn),
        .(x, y, z)]
}

microbenchmark::microbenchmark(times=1L, m0(), m1(), m2(), m3())

timings:

Unit: milliseconds
 expr      min       lq     mean   median       uq      max neval
 m0() 782.6070 782.6070 782.6070 782.6070 782.6070 782.6070     1
 m1() 713.9469 713.9469 713.9469 713.9469 713.9469 713.9469     1
 m2() 272.6018 272.6018 272.6018 272.6018 272.6018 272.6018     1
 m3() 765.3667 765.3667 765.3667 765.3667 765.3667 765.3667     1

data:

library(data.table)
set.seed(0L)
nr <- 2e4L
nrng <- 1e3L
dat <- data.table(x=runif(nr), y=runif(nr), z=runif(nr))
ranges <- data.frame(x_min=runif(nrng, max = 0.5), x_max=runif(nrng, min = 0.5),
    y_min=runif(nrng, max = 0.5), y_max=runif(nrng, min = 0.5))
dat[, rn := .I]

DT3 <- copy(dat)
DT <- copy(dat)

Subsetting a data.table by range making use of binary search

Interesting question. First let's look at the example data :

> print(DT)
                     x        y
       1: 2.607703e-07 5.748127
       2: 8.894131e-07 5.233994
       3: 1.098961e-06 9.834267
       4: 1.548324e-06 2.016585
       5: 1.569279e-06 7.957730
      ---                      
 9999996: 9.999996e+00 9.977782
 9999997: 9.999998e+00 2.666575
 9999998: 9.999999e+00 6.869967
 9999999: 9.999999e+00 1.953145
10000000: 1.000000e+01 4.001616
> length(DT$x)
[1] 10000000
> length(unique(DT$x))
[1] 9988478
> length(DT$y)
[1] 10000000
> length(unique(DT$y))
[1] 9988225
> DT[,.N,by=x][,table(N)]
N
      1       2       3 
9976965   11504       9 
> DT[,.N,by="x,y"][,table(N)]
N
       1 
10000000 
>

So there are almost 10 million unique floating point values in the first column: a few groups of size 2 and 3 rows but mostly 1 row groups. Once the second column is including, there are 10 million unique groups of size 1 row. This is quite a tough problem, since data.table is designed more for grouped data in mind; e.g, (id, date), (id1, id2, date, time) etc.

However, data.table and setkey do support floating point data in keys, so let's give it a go.

On my slow netbook :

> system.time(setkey(DT,x,y))
   user  system elapsed 
  7.097   0.520   7.650 

> system.time(DT[x>5 & y<7])
   user  system elapsed 
  2.820   0.292   3.122

So the vector scanning approach is faster than setting the key (and we haven't even used the key yet). Given the data is floating point and almost unique then this isn't too surprising, but I think that's a pretty fast time for setkey to sort 10 million thoroughly random and almost unique doubles.

Compare to base for example, just sorting x not even y as well :

> system.time(base::order(x))
   user  system elapsed 
 72.445   0.292  73.072

Assuming this data is representative of your real data, and you don't want to do this just once but several times, so are willing to pay the price of setkey, the first step is pretty clear :

system.time(w <- DT[.(5),which=TRUE,roll=TRUE])
   user  system elapsed 
  0.004   0.000   0.003 
> w
[1] 4999902

But here we're stuck. A next step like DT[(w+1):nrow(DT)] is ugly and copies. I can't think of a decent way to use the key from here to do the y<7 part as well. In other example data we do something like DT[.(unique(x), 7), which=TRUE, roll=TRUE] but in this case the data is so unique and floating point that's going to be slow.

Ideally, this task needs range joins (FR#203) implementing. The syntax in this example might be :

DT[.( c(5,Inf), c(-Inf,7) )]

or to make it easier, DT[x>5 & y<7] could be optimized to do that under the hood. Allowing a two-column range in i that joins to the corresponding x columns could be quite useful and has come up several times.

The speedups in v1.9.2 needed to be done first before we could move on to things like that. If you try setkey on this data in v1.8.10 you'll find that v1.9.2 is significantly faster.

select rows in data.table using only the second key

Try this:

> a[ .(unique(id), 1),, nomatch = 0 ]
   id var
1:  a   1
2:  c   1

ADDED. We could avoid having to scan id by making it a factor and using the levels:

> a<-data.table(id=factor(rep(letters[1:4],2)), var=c(1,2,1:6), key="id,var")
> a[ .(levels(id), 1),, nomatch = 0 ]
   id var
1:  a   1
2:  c   1

data.table: How to do binary search for two (numeric) values at one key: example included

You don't need to convert it into a character vector (although integer would make more sense)

 DT <- data.table(a = c(1, 3, 5, 9, 15), b = c("a", "c", "d", "e", "f"))
 setkey(DT, a)
 DT[J(c(3, 9))]

Moreover, if you have the latest version in CRAN, the second time you use a in i will automatically uses binary search

Can we do binary search in data.table with OR select queries

Yes:

In order to do a binary serach, we need the appropriate indices.

  indx <- rbind(DT[y==25, list(y=25), by=x], DT[.("a"), list(x="a"), by=y], use.names=TRUE)
  indx <- setdiff(indx, setdiff(indx, unique(DT[, key(DT), with=FALSE])))
  indx

  DT[.(indx)]

Benchmarking:

This gives us more than a 10x improvement over vectorized serach.

  identical(setkey(DT[.(indx)]), setkey(DT[x=="a" | y == 25]))
  # [1] TRUE


  library(microbenchmark)
  microbenchmark(UsingIndx = DT[.(indx)],  UsingVecSearch = DT[x=="a" | y == 25], times=100 )

  Unit: milliseconds
              expr       min        lq    median        uq      max
  1      UsingIndx  34.27562  41.70119  48.13215  49.29752 231.1669
  2 UsingVecSearch 506.62670 545.85673 636.67701 680.93894 802.0842

For convenience, we can wrap the "creating the index" portion of the code into a nice function,
so that we can then call it in a single line. For example:

DT[.(OrIndx("a", 25, DT))]

Where OrIndx() is defined as follows:

OrIndx <- function(xval, yval, DT)  {
  # TODO: Allow for arbitrary columns and column names
  if(!is.data.table(DT))
    stop("DT is not a data.table")

  # create all appropriate combinations
  indx <- rbind(DT[y==yval, list(y=yval), by=x], DT[.(xval), list(x=xval), by=y], use.names=TRUE)

  # take out any combinations in indx that are not actually present in DT and return 
  return( setdiff(indx, setdiff(indx, unique(DT[, key(DT), with=FALSE]))) )
}

Explanation:

The idea here is that performing an "or" serach requires some form of combination.

In a standard vector search, this combination is of the results of each individual vector serach.

data.table offers some great speed improvements by allowing seraches such as

 DT[.(c("cdf", "tmb"), c(25, 3))]

Therefore, a natural solution to the question would be to use:

 DT[.(c(<all values of x>, "a"), c(25, <all values of y>))]

The only problem is that the recycling would not line up properly.

It would be ideal to have an option like

 DT[.(  list( c(unique(x), y=25), c(x="a", y=unique(y) )  )]

But as far as I can tell that has not been implemented (yet!)

So instead, we can take appropriate combinations.

The function OrIndx above does exactly that. (it s quick & dirty and there are more efficient ways of creating the index)

Update with additional benchmarks

As per @Aruns suggestion, we include

rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])
rbindlist(list( DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)] ))

Tested on 1e6 and 1e7 rows:

  ## Using 1 Million rows
  > microbenchmark(Using_Indx = DT[.(indx)], Using_RbindList = rbindlist(list(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])), Using_Rbind = rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)]),  Using_VecSearch = DT[x=="a" | y == 25], times=70L )
  Unit: milliseconds
               expr       min        lq    median        uq        max
  1      Using_Indx  4.865089  5.755615  5.813938  5.957352   6.880743
  2     Using_Rbind 42.657953 49.239558 49.682407 50.505977 139.770670
  3 Using_RbindList 36.319170 44.169151 44.484350 45.279158 155.361338
  4 Using_VecSearch 49.003307 64.030384 64.443666 65.123886 150.099946


  ## Using 10 Milliion rows
  Unit: milliseconds
               expr       min       lq   median        uq      max
  1      Using_Indx  33.71108  47.5402  48.7574  50.75285 122.0950
  2     Using_rbind 492.38244 535.6062 565.8623 590.92841 727.3907
  3 Using_RbindList 436.29325 478.3626 507.4665 525.25980 657.6639
  4 Using_VecSearch 511.86248 607.8046 643.9822 688.36733 765.3997

  # Making sure all the same results: 
  > identical(setkey(DT[.(indx)]), setkey(DT[x=="a" | y == 25]))
  [1] TRUE
  > identical(setkey(DT[.(indx)]), setkey(rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])))
  [1] TRUE

Note that for SMALL tabbles (less than 15K rows), vector search is faster (for really small tables, about twice as fast)

  ## Using  100 Rows
  > microbenchmark(Using_Indx = DT[.(indx)], Using_RbindList = rbindlist(list(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])), Using_rbind = rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)]),  Using_VecSearch = DT[x=="a" | y == 25], times=150L )
  Unit: microseconds
               expr      min       lq    median       uq      max
  1      Using_Indx  884.819  901.854  917.3715  933.642 9740.046
  2     Using_rbind 2385.842 2424.893 2462.5210 2502.704 4266.637
  3 Using_RbindList 1962.504 2005.594 2027.4085 2069.516 4238.146
  4 Using_VecSearch  386.867  401.328  407.5730  420.647 2908.090

This pattern holds until about 10,000 rows, at which point we start to see the gains:

  ## 10,000 Rows
  Unit: microseconds
               expr      min       lq    median       uq      max
  1      Using_Indx  891.374  921.784  931.6585  956.737 3780.971
  4 Using_VecSearch  796.316  815.965  824.1480  845.151 2531.314

  ## 15,000 Rows
  Unit: microseconds
               expr      min       lq   median       uq      max
  1      Using_Indx  913.963  939.198  954.518  986.609 2900.174
  4 Using_VecSearch 1018.830 1041.449 1053.098 1072.188 8418.470

  ## 30,000 Rows
  Unit: microseconds
               expr      min       lq   median       uq       max
  1      Using_Indx  964.402  995.883 1018.535 1045.908  5999.390
  4 Using_VecSearch 1649.231 1709.090 1801.760 1927.976  8868.470

  ## 100,000 Rows
  Unit: milliseconds
               expr      min       lq   median       uq       max
  1      Using_Indx 1.142318 1.181023 1.198611 1.268417  3.611945
  4 Using_VecSearch 4.663948 4.763179 5.052995 6.058354 12.133510

  ## 10,000,000 Rows   (only ran 30 reps for this one)
  Unit: milliseconds
               expr       min        lq    median        uq      max
  1      Using_Indx  33.95004  42.24995  48.90363  50.15424 177.0991
  2 Using_VecSearch 512.34760 557.02867 622.37670 662.14323 861.3465

Subsetting Data.Table by 2Nd Column Only of a 2 Column Key, Using Binary Search Not Vector Scan