Why is expand.grid faster than data.table 's CJ?
Thanks for reporting this. This has been fixed now in data.table 1.8.9. Here's the timing test with the latest commit (913):
system.time(expand.grid(1:1000,1:10000))
# user system elapsed
# 1.420 0.552 1.987
system.time(CJ(1:1000,1:10000))
# user system elapsed
# 0.080 0.092 0.171
From NEWS :
CJ() is 90% faster on 1e6 rows (for example), #4849. The inputs are now sorted first before combining rather than after combining and uses rep.int instead of rep (thanks to Sean Garborg for the ideas, code and benchmark) and only sorted if is.unsorted(), #2321.
Also check out NEWS for other notable features that have made it in and bug fixes; e.g., CJ()
gains a new sorted
argument too.
How to speed up `expand.grid()` in R?
You may try data.table::CJ
function.
bench::mark(base = expand.grid(year, names),
jc = expand.grid.jc(year, names),
tidyr1 = tidyr::expand_grid(year, names),
tidyr2 = tidyr::crossing(year, names),
dt = data.table::CJ(year, names),
check = FALSE, iterations = 10)
# expression min median `itr/sec` mem_alloc `gc/sec` n_itr n_gc total_time result memory time gc
# <bch:expr> <bch:tm> <bch:tm> <dbl> <bch:byt> <dbl> <int> <dbl> <bch:tm> <list> <list> <list> <lis>
#1 base 635.48ms 715.02ms 1.25 699MB 2.00 10 16 8.02s <NULL> <Rprof… <benc… <tib…
#2 jc 5.66s 5.76s 0.172 820MB 0.275 10 16 58.13s <NULL> <Rprof… <benc… <tib…
#3 tidyr1 195.03ms 268.97ms 4.01 308MB 2.00 10 5 2.5s <NULL> <Rprof… <benc… <tib…
#4 tidyr2 590.91ms 748.35ms 1.31 312MB 0.656 10 5 7.62s <NULL> <Rprof… <benc… <tib…
#5 dt 318.1ms 384.21ms 2.47 206MB 0.986 10 4 4.06s <NULL> <Rprof… <benc… <tib…
PS - Also included tidyr::crossing
for comparison as it does the same thing.
R: Why is expand.grid() producing many more rows than I expect?
expand.grid
makes no attempt to return only unique values of the input vectors. It will always output a data frame which has a number of rows that is the same as the product of the length of its input vectors:
nrow(expand.grid(1:10, 1:10, 1:10))
#> [1] 1000
nrow(expand.grid(1, 1, 1, 1, 1, 1, 1, 1, 1))
#> [1] 1
If you look at the source code for expand.grid
, it takes the variadic dots and turns them into a list called args
. It then includes the line:
d <- lengths(args)
which returns a vector with one entry for each vector that we feed into expand.grid
. In the case of expand.grid(df$x, df$y)
, d
would be equivalent to c(100, 100)
.
There then follows the line
orep <- prod(d)
which gives us the product of d
, which is 100x100, or 10,000.
The variable orep
is used later in the function to repeat each vector so that its length is equal to the value orep
.
If you only want unique combinations of the two input vectors, then you must make them unique
at the input to expand.grid
.
replicate `expand.grid()` behavior with data.frames using tidyr/data.table
Try with do.call
> do.call(tidyr::expand_grid, df)
# A tibble: 4 x 2
V1 V2
<dbl> <dbl>
1 0.3 0.6
2 0.3 0.4
3 0.7 0.6
4 0.7 0.4
> do.call(tidyr::crossing, df)
# A tibble: 4 x 2
V1 V2
<dbl> <dbl>
1 0.3 0.4
2 0.3 0.6
3 0.7 0.4
4 0.7 0.6
> do.call(data.table::CJ, df)
V1 V2
1: 0.3 0.4
2: 0.3 0.6
3: 0.7 0.4
4: 0.7 0.6
Expand two large data files and apply using data.table?
Here is a data.table solution: should be pretty fast:
library(data.table)
indx<-CJ(indx1=seq(nrow(df2)),indx2=seq(nrow(df1))) #CJ is data.table function for expand.grid
indx[,`:=`(result=foo.new(df1[indx1, ], df2[indx2, ]),Group.1=rep(seq(nrow(df1)), each = nrow(df2)))][,.(sums=sum(result)),by=Group.1]
Group.1 sums
1: 1 355
2: 2 365
3: 3 375
4: 4 385
5: 5 395
6: 6 405
7: 7 415
8: 8 425
9: 9 435
10: 10 445
Using data.table's fast expand grid CJ does not work when a sublists is integer 0
CJ
function comes from data.table
so it is worth to add that tag to question.
There is an open FR to create CJ
generic method, so it could handle different types separately.
Below the function which address your question.
library(data.table)
f = function(x){
stopifnot(is.list(x))
ll = sapply(x, length)
if(any(ll == 0L)) x[ll == 0L] = 0L
do.call(CJ, args = x)
}
x = list(c(1,2,3,4,3,2,1,2),c(1,2,3,4),c(5,6,4),c(integer(0)))
f(x)
Should data.table's CJ continue accommodating arguments with duplicate elements?
For completeness, to add an answer, from NEWS :
CJ() is 90% faster on 1e6 rows (for example), #4849. The inputs are now sorted first before combining rather than after combining and uses rep.int instead of rep (thanks to Sean Garborg for the ideas, code and benchmark) and only sorted if is.unsorted(), #2321.
Reminder: CJ = Cross Join; i.e., joins to all combinations of its inputs.
Why data.table CJ doesn't respect column major order
It's convenient to have the result of CJ
sorted like that, as it can then be keyed by all of the columns, which it is, which then enables operations like this:
dt = data.table(a = c(1,2,1), b = 1:3, c = c('a', 'a', 'b'))
setkey(dt, a, c)
# a b c
#1: 1 1 a
#2: 1 3 b
#3: 2 2 a
dt[CJ(unique(a), unique(c))]
# a b c
#1: 1 1 a
#2: 1 3 b
#3: 2 2 a
#4: 2 NA b
# just checking the key:
key(dt[, CJ(unique(a), unique(c))])
#[1] "V1" "V2"
Related Topics
How to Put Exact Number of Decimal Places on Label Ggplot Bar Chart
Function for Retrieving Own Ip Address from Within R
How to Exit a Shiny App and Return a Value
R Print Equation of Linear Regression on the Plot Itself
Relative Positioning of Geom_Text in Ggplot2
How to Make a Ggplot2 Contour Plot Analogue to Lattice:Filled.Contour()
Mutating Column in 'Dplyr' Using 'Rowsums'
Vary Colors of Axis Labels in R Based on Another Variable
Matching Multiple Columns on Different Data Frames and Getting Other Column as Result
Extract Knots, Basis, Coefficients and Predictions for P-Splines in Adaptive Smooth
R Draw All Axis Labels (Prevent Some from Being Skipped)
R Reading in a Zip Data File Without Unzipping It
How to Extract Substring Between Patterns "_" and "." in R