Why Is Expand.Grid Faster Than Data.Table 's Cj

Why is expand.grid faster than data.table 's CJ?

Thanks for reporting this. This has been fixed now in data.table 1.8.9. Here's the timing test with the latest commit (913):

system.time(expand.grid(1:1000,1:10000))
# user system elapsed
# 1.420 0.552 1.987

system.time(CJ(1:1000,1:10000))
# user system elapsed
# 0.080 0.092 0.171

From NEWS :

CJ() is 90% faster on 1e6 rows (for example), #4849. The inputs are now sorted first before combining rather than after combining and uses rep.int instead of rep (thanks to Sean Garborg for the ideas, code and benchmark) and only sorted if is.unsorted(), #2321.

Also check out NEWS for other notable features that have made it in and bug fixes; e.g., CJ() gains a new sorted argument too.

How to speed up `expand.grid()` in R?

You may try data.table::CJ function.

bench::mark(base = expand.grid(year, names),
jc = expand.grid.jc(year, names),
tidyr1 = tidyr::expand_grid(year, names),
tidyr2 = tidyr::crossing(year, names),
dt = data.table::CJ(year, names),
check = FALSE, iterations = 10)

# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <lis>
#1 base 635.48ms 715.02ms 1.25 699MB 2.00 10 16 8.02s <NULL> <Rprof… <benc… <tib…
#2 jc 5.66s 5.76s 0.172 820MB 0.275 10 16 58.13s <NULL> <Rprof… <benc… <tib…
#3 tidyr1 195.03ms 268.97ms 4.01 308MB 2.00 10 5 2.5s <NULL> <Rprof… <benc… <tib…
#4 tidyr2 590.91ms 748.35ms 1.31 312MB 0.656 10 5 7.62s <NULL> <Rprof… <benc… <tib…
#5 dt 318.1ms 384.21ms 2.47 206MB 0.986 10 4 4.06s <NULL> <Rprof… <benc… <tib…

PS - Also included tidyr::crossing for comparison as it does the same thing.

R: Why is expand.grid() producing many more rows than I expect?

expand.grid makes no attempt to return only unique values of the input vectors. It will always output a data frame which has a number of rows that is the same as the product of the length of its input vectors:

nrow(expand.grid(1:10, 1:10, 1:10))
#> [1] 1000

nrow(expand.grid(1, 1, 1, 1, 1, 1, 1, 1, 1))
#> [1] 1

If you look at the source code for expand.grid, it takes the variadic dots and turns them into a list called args. It then includes the line:

d <- lengths(args)

which returns a vector with one entry for each vector that we feed into expand.grid. In the case of expand.grid(df$x, df$y), d would be equivalent to c(100, 100).

There then follows the line

orep <- prod(d)

which gives us the product of d, which is 100x100, or 10,000.

The variable orep is used later in the function to repeat each vector so that its length is equal to the value orep.

If you only want unique combinations of the two input vectors, then you must make them unique at the input to expand.grid.

replicate `expand.grid()` behavior with data.frames using tidyr/data.table

Try with do.call

> do.call(tidyr::expand_grid, df)
# A tibble: 4 x 2
V1 V2
<dbl> <dbl>
1 0.3 0.6
2 0.3 0.4
3 0.7 0.6
4 0.7 0.4

> do.call(tidyr::crossing, df)
# A tibble: 4 x 2
V1 V2
<dbl> <dbl>
1 0.3 0.4
2 0.3 0.6
3 0.7 0.4
4 0.7 0.6

> do.call(data.table::CJ, df)
V1 V2
1: 0.3 0.4
2: 0.3 0.6
3: 0.7 0.4
4: 0.7 0.6

Expand two large data files and apply using data.table?

Here is a data.table solution: should be pretty fast:

library(data.table)
indx<-CJ(indx1=seq(nrow(df2)),indx2=seq(nrow(df1))) #CJ is data.table function for expand.grid
indx[,`:=`(result=foo.new(df1[indx1, ], df2[indx2, ]),Group.1=rep(seq(nrow(df1)), each = nrow(df2)))][,.(sums=sum(result)),by=Group.1]

Group.1 sums
1: 1 355
2: 2 365
3: 3 375
4: 4 385
5: 5 395
6: 6 405
7: 7 415
8: 8 425
9: 9 435
10: 10 445

Using data.table's fast expand grid CJ does not work when a sublists is integer 0

CJ function comes from data.table so it is worth to add that tag to question.

There is an open FR to create CJ generic method, so it could handle different types separately.

Below the function which address your question.

library(data.table)
f = function(x){
stopifnot(is.list(x))
ll = sapply(x, length)
if(any(ll == 0L)) x[ll == 0L] = 0L
do.call(CJ, args = x)
}
x = list(c(1,2,3,4,3,2,1,2),c(1,2,3,4),c(5,6,4),c(integer(0)))
f(x)

Should data.table's CJ continue accommodating arguments with duplicate elements?

For completeness, to add an answer, from NEWS :

CJ() is 90% faster on 1e6 rows (for example), #4849. The inputs are now sorted first before combining rather than after combining and uses rep.int instead of rep (thanks to Sean Garborg for the ideas, code and benchmark) and only sorted if is.unsorted(), #2321.

Reminder: CJ = Cross Join; i.e., joins to all combinations of its inputs.

Why data.table CJ doesn't respect column major order

It's convenient to have the result of CJ sorted like that, as it can then be keyed by all of the columns, which it is, which then enables operations like this:

dt = data.table(a = c(1,2,1), b = 1:3, c = c('a', 'a', 'b'))
setkey(dt, a, c)
# a b c
#1: 1 1 a
#2: 1 3 b
#3: 2 2 a

dt[CJ(unique(a), unique(c))]
# a b c
#1: 1 1 a
#2: 1 3 b
#3: 2 2 a
#4: 2 NA b

# just checking the key:
key(dt[, CJ(unique(a), unique(c))])
#[1] "V1" "V2"


Related Topics



Leave a reply



Submit