Subsetting Data.Table by 2Nd Column Only of a 2 Column Key, Using Binary Search Not Vector Scan

Subsetting data.table by 2nd column only of a 2 column key, using binary search not vector scan

Yes, you can pass all values to the first key value and subset with the specific value for the second key.

DT[J(unique(x), 25), nomatch=0]

If you need to subset by more than one value in the second key (e.g. the equivalent of DT[y %in% 25:24]), a more general solution is to use CJ

DT[CJ(unique(x), 25:24), nomatch=0]

Note that CJ by default sorts the columns and sets key to all the columns, which means the result would be sorted as well. If that's not desirable, you should use sorted=FALSE

DT[CJ(unique(x), 25:24, sorted=FALSE), nomatch=0]

There's also a feature request to add secondary keys to data.table in future. I believe the plan is to add a new function set2key.

FR#1007 Build in secondary keys

There is also merge, which has a method for data.table. It builds the secondary key inside it for you, so should be faster than base merge. See ?merge.data.table.

Subsetting a data.table with a binary search by range in two columns

My understanding is that rolling join uses binary search but only on the last joining key, hence it is not simultaneously possible to perform rolling join on 4 keys. In addition, your values are non-integer in nature and hence it is not possible to pinpoint the 4 corners using binary search.

Having said that, here are some options to speed up the subsetting with non-equi join being the fastest but I face some memory limitation issues with your dimensions:

m0 <- function()
lapply(seq_len(nrow(ranges)), function(i){
DT[x%between%c(ranges[i,]$x_min, ranges[i,]$x_max)&
y%between%c(ranges[i,]$y_min, ranges[i,]$y_max)]
})

m1 <- function()
ranges[, DT[x %between% c(x_min, x_max) & y %between% c(y_min, y_max)], 1L:nrow(ranges)]

m2 <- function() {
setkey(DT, x, y)
setDT(ranges, key=c("x_min", "x_max", "y_min", "y_max"))
DT[ranges, on=.(x>=x_min, x<=x_max, y>=y_min, y<=y_max), allow.cartesian=TRUE, .(x.x, x.y, x.z)]
}

m3 <- function() {
setkey(DT3, x)[, rn := .I]
ranges[, ixmin := DT3[.SD, on=.(x=x_min), roll=-Inf, rn]]
ranges[, ixmax := DT3[.SD, on=.(x=x_max), roll=Inf, rn]]

setkey(DT3, y)
DT3[DT3[ranges, on=.(y>=y_min, y<=y_max),
by=.EACHI, .(rn=rn[rn %between% c(ixmin, ixmax)])], on=.(rn),
.(x, y, z)]
}

microbenchmark::microbenchmark(times=1L, m0(), m1(), m2(), m3())

timings:

Unit: milliseconds
expr min lq mean median uq max neval
m0() 782.6070 782.6070 782.6070 782.6070 782.6070 782.6070 1
m1() 713.9469 713.9469 713.9469 713.9469 713.9469 713.9469 1
m2() 272.6018 272.6018 272.6018 272.6018 272.6018 272.6018 1
m3() 765.3667 765.3667 765.3667 765.3667 765.3667 765.3667 1

data:

library(data.table)
set.seed(0L)
nr <- 2e4L
nrng <- 1e3L
dat <- data.table(x=runif(nr), y=runif(nr), z=runif(nr))
ranges <- data.frame(x_min=runif(nrng, max = 0.5), x_max=runif(nrng, min = 0.5),
y_min=runif(nrng, max = 0.5), y_max=runif(nrng, min = 0.5))
dat[, rn := .I]

DT3 <- copy(dat)
DT <- copy(dat)

Subsetting a data.table by range making use of binary search

Interesting question. First let's look at the example data :

> print(DT)
x y
1: 2.607703e-07 5.748127
2: 8.894131e-07 5.233994
3: 1.098961e-06 9.834267
4: 1.548324e-06 2.016585
5: 1.569279e-06 7.957730
---
9999996: 9.999996e+00 9.977782
9999997: 9.999998e+00 2.666575
9999998: 9.999999e+00 6.869967
9999999: 9.999999e+00 1.953145
10000000: 1.000000e+01 4.001616
> length(DT$x)
[1] 10000000
> length(unique(DT$x))
[1] 9988478
> length(DT$y)
[1] 10000000
> length(unique(DT$y))
[1] 9988225
> DT[,.N,by=x][,table(N)]
N
1 2 3
9976965 11504 9
> DT[,.N,by="x,y"][,table(N)]
N
1
10000000
>

So there are almost 10 million unique floating point values in the first column: a few groups of size 2 and 3 rows but mostly 1 row groups. Once the second column is including, there are 10 million unique groups of size 1 row. This is quite a tough problem, since data.table is designed more for grouped data in mind; e.g, (id, date), (id1, id2, date, time) etc.

However, data.table and setkey do support floating point data in keys, so let's give it a go.

On my slow netbook :

> system.time(setkey(DT,x,y))
user system elapsed
7.097 0.520 7.650

> system.time(DT[x>5 & y<7])
user system elapsed
2.820 0.292 3.122

So the vector scanning approach is faster than setting the key (and we haven't even used the key yet). Given the data is floating point and almost unique then this isn't too surprising, but I think that's a pretty fast time for setkey to sort 10 million thoroughly random and almost unique doubles.

Compare to base for example, just sorting x not even y as well :

> system.time(base::order(x))
user system elapsed
72.445 0.292 73.072

Assuming this data is representative of your real data, and you don't want to do this just once but several times, so are willing to pay the price of setkey, the first step is pretty clear :

system.time(w <- DT[.(5),which=TRUE,roll=TRUE])
user system elapsed
0.004 0.000 0.003
> w
[1] 4999902

But here we're stuck. A next step like DT[(w+1):nrow(DT)] is ugly and copies. I can't think of a decent way to use the key from here to do the y<7 part as well. In other example data we do something like DT[.(unique(x), 7), which=TRUE, roll=TRUE] but in this case the data is so unique and floating point that's going to be slow.

Ideally, this task needs range joins (FR#203) implementing. The syntax in this example might be :

DT[.( c(5,Inf), c(-Inf,7) )]

or to make it easier, DT[x>5 & y<7] could be optimized to do that under the hood. Allowing a two-column range in i that joins to the corresponding x columns could be quite useful and has come up several times.

The speedups in v1.9.2 needed to be done first before we could move on to things like that. If you try setkey on this data in v1.8.10 you'll find that v1.9.2 is significantly faster.

See also :

How to self join a data.table on a condition

Remove a range in data.table

select rows in data.table using only the second key

Try this:

> a[ .(unique(id), 1),, nomatch = 0 ]
id var
1: a 1
2: c 1

ADDED. We could avoid having to scan id by making it a factor and using the levels:

> a<-data.table(id=factor(rep(letters[1:4],2)), var=c(1,2,1:6), key="id,var")
> a[ .(levels(id), 1),, nomatch = 0 ]
id var
1: a 1
2: c 1

data.table: How to do binary search for two (numeric) values at one key: example included

You don't need to convert it into a character vector (although integer would make more sense)

 DT <- data.table(a = c(1, 3, 5, 9, 15), b = c("a", "c", "d", "e", "f"))
setkey(DT, a)
DT[J(c(3, 9))]

Moreover, if you have the latest version in CRAN, the second time you use a in i will automatically uses binary search

Can we do binary search in data.table with OR select queries

Yes:

In order to do a binary serach, we need the appropriate indices.

  indx <- rbind(DT[y==25, list(y=25), by=x], DT[.("a"), list(x="a"), by=y], use.names=TRUE)
indx <- setdiff(indx, setdiff(indx, unique(DT[, key(DT), with=FALSE])))
indx

DT[.(indx)]

Benchmarking:

This gives us more than a 10x improvement over vectorized serach.

  identical(setkey(DT[.(indx)]), setkey(DT[x=="a" | y == 25]))
# [1] TRUE


library(microbenchmark)
microbenchmark(UsingIndx = DT[.(indx)], UsingVecSearch = DT[x=="a" | y == 25], times=100 )

Unit: milliseconds
expr min lq median uq max
1 UsingIndx 34.27562 41.70119 48.13215 49.29752 231.1669
2 UsingVecSearch 506.62670 545.85673 636.67701 680.93894 802.0842


For convenience, we can wrap the "creating the index" portion of the code into a nice function,
so that we can then call it in a single line. For example:

DT[.(OrIndx("a", 25, DT))]

Where OrIndx() is defined as follows:

OrIndx <- function(xval, yval, DT)  {
# TODO: Allow for arbitrary columns and column names
if(!is.data.table(DT))
stop("DT is not a data.table")

# create all appropriate combinations
indx <- rbind(DT[y==yval, list(y=yval), by=x], DT[.(xval), list(x=xval), by=y], use.names=TRUE)

# take out any combinations in indx that are not actually present in DT and return
return( setdiff(indx, setdiff(indx, unique(DT[, key(DT), with=FALSE]))) )
}

Explanation:

The idea here is that performing an "or" serach requires some form of combination.

In a standard vector search, this combination is of the results of each individual vector serach.

data.table offers some great speed improvements by allowing seraches such as

 DT[.(c("cdf", "tmb"), c(25, 3))] 

Therefore, a natural solution to the question would be to use:

 DT[.(c(<all values of x>, "a"), c(25, <all values of y>))] 

The only problem is that the recycling would not line up properly.

It would be ideal to have an option like

 DT[.(  list( c(unique(x), y=25), c(x="a", y=unique(y) )  )]

But as far as I can tell that has not been implemented (yet!)

So instead, we can take appropriate combinations.

The function OrIndx above does exactly that. (it s quick & dirty and there are more efficient ways of creating the index)


Update with additional benchmarks

As per @Aruns suggestion, we include

rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])
rbindlist(list( DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)] ))

Tested on 1e6 and 1e7 rows:

  ## Using 1 Million rows
> microbenchmark(Using_Indx = DT[.(indx)], Using_RbindList = rbindlist(list(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])), Using_Rbind = rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)]), Using_VecSearch = DT[x=="a" | y == 25], times=70L )
Unit: milliseconds
expr min lq median uq max
1 Using_Indx 4.865089 5.755615 5.813938 5.957352 6.880743
2 Using_Rbind 42.657953 49.239558 49.682407 50.505977 139.770670
3 Using_RbindList 36.319170 44.169151 44.484350 45.279158 155.361338
4 Using_VecSearch 49.003307 64.030384 64.443666 65.123886 150.099946


## Using 10 Milliion rows
Unit: milliseconds
expr min lq median uq max
1 Using_Indx 33.71108 47.5402 48.7574 50.75285 122.0950
2 Using_rbind 492.38244 535.6062 565.8623 590.92841 727.3907
3 Using_RbindList 436.29325 478.3626 507.4665 525.25980 657.6639
4 Using_VecSearch 511.86248 607.8046 643.9822 688.36733 765.3997

# Making sure all the same results:
> identical(setkey(DT[.(indx)]), setkey(DT[x=="a" | y == 25]))
[1] TRUE
> identical(setkey(DT[.(indx)]), setkey(rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])))
[1] TRUE

Note that for SMALL tabbles (less than 15K rows), vector search is faster (for really small tables, about twice as fast)

  ## Using  100 Rows
> microbenchmark(Using_Indx = DT[.(indx)], Using_RbindList = rbindlist(list(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)])), Using_rbind = rbind(DT[J("a")], DT[J(setdiff(unique(x), "a"), 25)]), Using_VecSearch = DT[x=="a" | y == 25], times=150L )
Unit: microseconds
expr min lq median uq max
1 Using_Indx 884.819 901.854 917.3715 933.642 9740.046
2 Using_rbind 2385.842 2424.893 2462.5210 2502.704 4266.637
3 Using_RbindList 1962.504 2005.594 2027.4085 2069.516 4238.146
4 Using_VecSearch 386.867 401.328 407.5730 420.647 2908.090

This pattern holds until about 10,000 rows, at which point we start to see the gains:

  ## 10,000 Rows
Unit: microseconds
expr min lq median uq max
1 Using_Indx 891.374 921.784 931.6585 956.737 3780.971
4 Using_VecSearch 796.316 815.965 824.1480 845.151 2531.314

## 15,000 Rows
Unit: microseconds
expr min lq median uq max
1 Using_Indx 913.963 939.198 954.518 986.609 2900.174
4 Using_VecSearch 1018.830 1041.449 1053.098 1072.188 8418.470

## 30,000 Rows
Unit: microseconds
expr min lq median uq max
1 Using_Indx 964.402 995.883 1018.535 1045.908 5999.390
4 Using_VecSearch 1649.231 1709.090 1801.760 1927.976 8868.470

## 100,000 Rows
Unit: milliseconds
expr min lq median uq max
1 Using_Indx 1.142318 1.181023 1.198611 1.268417 3.611945
4 Using_VecSearch 4.663948 4.763179 5.052995 6.058354 12.133510

## 10,000,000 Rows (only ran 30 reps for this one)
Unit: milliseconds
expr min lq median uq max
1 Using_Indx 33.95004 42.24995 48.90363 50.15424 177.0991
2 Using_VecSearch 512.34760 557.02867 622.37670 662.14323 861.3465


Related Topics



Leave a reply



Submit