R: Data.Table Cross-Join Not Working

How to do cross join in R?

Is it just all=TRUE?

x<-data.frame(id1=c("a","b","c"),vals1=1:3)
y<-data.frame(id2=c("d","e","f"),vals2=4:6)
merge(x,y,all=TRUE)

From documentation of merge:

If by or both by.x and by.y are of length 0 (a length zero vector or NULL), the result, r, is the Cartesian product of x and y, i.e., dim(r) = c(nrow(x)*nrow(y), ncol(x) + ncol(y)).

R: data.table cross-join not working

There is no cross join functionality available in data.table out of the box.

Yet there is CJ.dt function (a CJ like but designed for data.tables) to achieve cartesian product (cross join) available in optiRum package (available in CRAN).

You can create the function:

CJ.dt = function(X,Y) {
stopifnot(is.data.table(X),is.data.table(Y))
k = NULL
X = X[, c(k=1, .SD)]
setkey(X, k)
Y = Y[, c(k=1, .SD)]
setkey(Y, NULL)
X[Y, allow.cartesian=TRUE][, k := NULL][]
}
CJ.dt(dtCustomers, dtDates1)
CJ.dt(dtCustomers, dtDates2)

Yet there is a FR for convenience way to perform cross join filled in data.table#1717, so you could check there if there is a nicer api for cross join.

Cartesian Product using data.table package

If you first construct full names from the first and last in the cust-dataframe, you can then use CJ (cross-join). You cannot use all three vectors since there would be 99 items and teh first names would get inappropriately mixed with last names.

> nrow(CJ(dates$date, cust$first.name, cust$last.name ) )
[1] 99

This returns the desired data.table object:

> CJ(dates$date,paste(cust$first.name, cust$last.name) )
V1 V2
1: 2012-08-28 George Smith
2: 2012-08-28 Henry Smith
3: 2012-08-28 John Doe
4: 2012-08-29 George Smith
5: 2012-08-29 Henry Smith
6: 2012-08-29 John Doe
7: 2012-08-30 George Smith
8: 2012-08-30 Henry Smith
9: 2012-08-30 John Doe
10: 2012-08-31 John Doe
11: 2012-08-31 George Smith
12: 2012-08-31 Henry Smith
13: 2012-09-01 John Doe
14: 2012-09-01 George Smith
15: 2012-09-01 Henry Smith
16: 2012-09-02 George Smith
17: 2012-09-02 Henry Smith
18: 2012-09-02 John Doe
19: 2012-09-03 Henry Smith
20: 2012-09-03 John Doe
21: 2012-09-03 George Smith
22: 2012-09-04 Henry Smith
23: 2012-09-04 John Doe
24: 2012-09-04 George Smith
25: 2012-09-05 George Smith
26: 2012-09-05 Henry Smith
27: 2012-09-05 John Doe
28: 2012-09-06 George Smith
29: 2012-09-06 Henry Smith
30: 2012-09-06 John Doe
31: 2012-09-07 George Smith
32: 2012-09-07 Henry Smith
33: 2012-09-07 John Doe
V1 V2

Cross join in Data.table doesnt seem to retain column names

That names are retained is not mentioned in the main body of the help file ?CJ, that is in the Details or Value section. However, there appears to be mention that names are retained as a comment in the examples section of the help file (and it looks like this is where you got your example).

Digging around in the CJ function, which appears to be entirely implemented in R, there is a block near the end,

if (getOption("datatable.CJ.names", FALSE))
vnames = name_dots(...)$vnames

Running getOption("datatable.CJ.names", FALSE) returns FALSE with data.table version 1.12.0. When we set this to TRUE with

options("datatable.CJ.names"=TRUE)

then the code

x = c(1,1,2)
y = c(4,6,4)

CJ(x, y)

returns

   x y
1: 1 4
2: 1 4
3: 1 4
4: 1 4
5: 1 6
6: 1 6
7: 2 4
8: 2 4
9: 2 6

However, you are also able to directly provide names (which is not mentioned in the help file).

CJ(uu=x, vv=y)

which returns

   uu vv
1: 1 4
2: 1 4
3: 1 4
4: 1 4
5: 1 6
6: 1 6
7: 2 4
8: 2 4
9: 2 6

Note that this overrides the above option.

R data.table cross-join by three variables

You can also do this:

data[, .(date=dates_wanted), .(group,id)]

Output:

     group     id       date
1: A frank 2020-01-01
2: A frank 2020-01-02
3: A frank 2020-01-03
4: A frank 2020-01-04
5: A frank 2020-01-05
---
120: B edward 2020-01-27
121: B edward 2020-01-28
122: B edward 2020-01-29
123: B edward 2020-01-30
124: B edward 2020-01-31

R data.table join two tables and keep all rows

This is cross join assign a New Key to help merge

DT1$Key=1
DT2$Key=1
DT3=merge(DT1,DT2,by='Key')
DT3 #DT3$Key=NULL remove the key
Key ID_1 val_1 ID_2 val_2
1: 1 1 1 3 3
2: 1 1 1 4 4
3: 1 2 2 3 3
4: 1 2 2 4 4

Cross join in data.table within a function

You could use the ... which are used to refer to arguments passed down from a calling function...?

require( data.table )
f <- function( ... ){
CJ(...)
}

f( c(1:2) , c(3:4) )
# V1 V2
#1: 1 3
#2: 1 4
#3: 2 3
#4: 2 4

Edit: How about this?

do.call(CJ, replicate(n, vals, simplify=FALSE))

# V1 V2 V3 V4
# 1: no no no no
# 2: no no no yes
# 3: no no yes no
# 4: no no yes yes
# 5: no yes no no
# 6: no yes no yes
# 7: no yes yes no
# 8: no yes yes yes
# 9: yes no no no
# 10: yes no no yes
# 11: yes no yes no
# 12: yes no yes yes
# 13: yes yes no no
# 14: yes yes no yes
# 15: yes yes yes no
# 16: yes yes yes yes


Related Topics



Leave a reply



Submit