How to Do Cross Join in R

How to do cross join in R?

Is it just all=TRUE?

x<-data.frame(id1=c("a","b","c"),vals1=1:3)
y<-data.frame(id2=c("d","e","f"),vals2=4:6)
merge(x,y,all=TRUE)

From documentation of merge:

If by or both by.x and by.y are of length 0 (a length zero vector or NULL), the result, r, is the Cartesian product of x and y, i.e., dim(r) = c(nrow(x)*nrow(y), ncol(x) + ncol(y)).

Cross Join in dplyr in R

You just need a dummy column to join on:

cust_time$k <- 1
cust_time %>% 
  inner_join(cust_time, by='k') %>%
  select(-k)

Or if you don't want to modify your original dataframe:

cust_time %>%
  mutate(k = 1) %>%
  replicate(2, ., simplify=FALSE) %>%
  Reduce(function(a, b) inner_join(a, b, by='k'), .) %>%
  select(-k)

How to Cross Join in R

We can use crossing

library(tidyr)
crossing(table1, table2)

Cartesian Product using data.table package

If you first construct full names from the first and last in the cust-dataframe, you can then use CJ (cross-join). You cannot use all three vectors since there would be 99 items and teh first names would get inappropriately mixed with last names.

> nrow(CJ(dates$date, cust$first.name, cust$last.name ) )
[1] 99

This returns the desired data.table object:

> CJ(dates$date,paste(cust$first.name, cust$last.name) )
            V1           V2
 1: 2012-08-28 George Smith
 2: 2012-08-28  Henry Smith
 3: 2012-08-28     John Doe
 4: 2012-08-29 George Smith
 5: 2012-08-29  Henry Smith
 6: 2012-08-29     John Doe
 7: 2012-08-30 George Smith
 8: 2012-08-30  Henry Smith
 9: 2012-08-30     John Doe
10: 2012-08-31     John Doe
11: 2012-08-31 George Smith
12: 2012-08-31  Henry Smith
13: 2012-09-01     John Doe
14: 2012-09-01 George Smith
15: 2012-09-01  Henry Smith
16: 2012-09-02 George Smith
17: 2012-09-02  Henry Smith
18: 2012-09-02     John Doe
19: 2012-09-03  Henry Smith
20: 2012-09-03     John Doe
21: 2012-09-03 George Smith
22: 2012-09-04  Henry Smith
23: 2012-09-04     John Doe
24: 2012-09-04 George Smith
25: 2012-09-05 George Smith
26: 2012-09-05  Henry Smith
27: 2012-09-05     John Doe
28: 2012-09-06 George Smith
29: 2012-09-06  Henry Smith
30: 2012-09-06     John Doe
31: 2012-09-07 George Smith
32: 2012-09-07  Henry Smith
33: 2012-09-07     John Doe
            V1           V2

Obtain the same cross join with merge() and sqldf::sqldf()

I think including ORDER BY in sqldf is important, since it drives home the fact that in SQL, ordering is never guaranteed unless explicitly directed.

If you were doing simple ORDER BY with just "increasing" on both variables, then the translation to order in R would be direct. However, since one variable is decreasing and one is increasing, order by itself doesn't deal with that. However, as suggested by https://stackoverflow.com/a/3316719, we can do the same with xtfrm.

out1 <- merge(x = Sales, y = Clients, by = NULL)
out1 <- out1[order(-xtfrm(out1$State), out1$CustomerID.y),]

out2 <- sqldf::sqldf(
  "SELECT *
   FROM Sales
   CROSS JOIN Clients 
   ORDER BY State DESC, Clients.CustomerID")

### proof they are identical
all(unlist(Map(`==`, out1, out2)))
# [1] TRUE

The xtfrm helper function here allows us to negate the "values" of a column for the purposes of sorting. From ?xtfrm:

A generic auxiliary function that produces a numeric vector which will sort in the same order as 'x'.

If the field were already numeric, we could merely do order(-State, CustomerID.y), but the fact that it is character requires a further step. Argo xtfrm.

Edit: in comments, it's determined that the OP wants to mimic the sort-order of merge in the SQL statement. Unfortunately, because this is a cartesian product of the two frames, no sorting is applied: merge merely cbinds all rows of the first frame against the first row of the second frame, then repeats with each row of the second.

This can be demonstrated by using some code from merge:

nx <- nrow(x) # Sales
ny <- nrow(y) # Clients
expand.grid(seq_len(nx), seq_len(ny))
#    Var1 Var2
# 1     1    1
# 2     2    1
# 3     3    1
# 4     4    1
# 5     5    1
# 6     1    2
# ...
# 33    3    7
# 34    4    7
# 35    5    7

where each number is a row from the respective frames (x for Var1, y for Var2). If the original data is:

## Sales                        ## Clients        
  Product CustomerID Price        CustomerID State
1 Toaster     1_2019    37      1     2_2019    AZ
2   Radio     1_2019    33      2     3_2019    MA
3   Radio     2_2019    33      3     4_2019    AZ
4      TV     3_2019   408      4     1_2020    IL
5 Toaster     1_2020    37      5     2_2020    MA
6      TV     2_2020   408
7      TV     3_2020   408

then this results in

out1
#    Product CustomerID.x Price CustomerID.y State
# 1  Toaster       1_2019    37       2_2019    AZ
# 2    Radio       1_2019    33       2_2019    AZ
# 3    Radio       2_2019    33       2_2019    AZ
# 4       TV       3_2019   408       2_2019    AZ
# 5  Toaster       1_2020    37       2_2019    AZ
# 6       TV       2_2020   408       2_2019    AZ
# 7       TV       3_2020   408       2_2019    AZ
# 8  Toaster       1_2019    37       3_2019    MA
# ...
# 33 Toaster       1_2020    37       2_2020    MA
# 34      TV       2_2020   408       2_2020    MA
# 35      TV       3_2020   408       2_2020    MA

which will very much destroy any sorting present in x (Sales), even if y (Clients) comes pre-sorted (which it does).

Because of this, if you want congruity between R and SQL cross-join solutions, I suggest the most transparent/clear way would be to merge in R and then apply post-merge ordering in a fashion that is similar to SQL. In fact, from a pedagogic perspective, ask the question: *"What ordering makes sense to humans?" If you assert during the lesson plan that ordering may not be assured until explicitly strong-armed into the process (via dplyr::arrange, x[order(...),], or SQL's ORDER BY clause). Find the intuitive ordering of the data and then demonstrate that in both R and SQL.

Side notes:

Your sqldf query results in same-named columns, this results in some errors post-sqldf if you start playing with columns. This can be mitigated with select ... as ... field-naming.
Lexicographic sorting of your data is unfortunately counter-intuitive at the moment: having year at the end of a customer id suggests (yes, I'm inferring) a timeline of customer onboarding, yet they will sort first by the leading number. Similar to how "2020-05-04" sorts correctly even as a string, while "05/04/2020" does not, it might support more intuitive sorting to have the most-significant portion be the leading part of id strings. Or make them integers. Or UUIDs (v4, of course), those are always fun.

cartesian product with dplyr R

Use crossing from the tidyr package:

x <- data.frame(x=c("a","b","c"))
y <- data.frame(y=c(1,2,3))

crossing(x, y)

Result:

R data.table cross-join by three variables

You can also do this:

data[, .(date=dates_wanted), .(group,id)]

Output:

     group     id       date
  1:     A  frank 2020-01-01
  2:     A  frank 2020-01-02
  3:     A  frank 2020-01-03
  4:     A  frank 2020-01-04
  5:     A  frank 2020-01-05
 ---                        
120:     B edward 2020-01-27
121:     B edward 2020-01-28
122:     B edward 2020-01-29
123:     B edward 2020-01-30
124:     B edward 2020-01-31

How to Do Cross Join in R