How to Do Cross Join in R

How to do cross join in R?

Is it just all=TRUE?

x<-data.frame(id1=c("a","b","c"),vals1=1:3)
y<-data.frame(id2=c("d","e","f"),vals2=4:6)
merge(x,y,all=TRUE)

From documentation of merge:

If by or both by.x and by.y are of length 0 (a length zero vector or NULL), the result, r, is the Cartesian product of x and y, i.e., dim(r) = c(nrow(x)*nrow(y), ncol(x) + ncol(y)).

Cross Join in dplyr in R

You just need a dummy column to join on:

cust_time$k <- 1
cust_time %>%
inner_join(cust_time, by='k') %>%
select(-k)

Or if you don't want to modify your original dataframe:

cust_time %>%
mutate(k = 1) %>%
replicate(2, ., simplify=FALSE) %>%
Reduce(function(a, b) inner_join(a, b, by='k'), .) %>%
select(-k)

How to Cross Join in R

We can use crossing

library(tidyr)
crossing(table1, table2)

Cartesian Product using data.table package

If you first construct full names from the first and last in the cust-dataframe, you can then use CJ (cross-join). You cannot use all three vectors since there would be 99 items and teh first names would get inappropriately mixed with last names.

> nrow(CJ(dates$date, cust$first.name, cust$last.name ) )
[1] 99

This returns the desired data.table object:

> CJ(dates$date,paste(cust$first.name, cust$last.name) )
V1 V2
1: 2012-08-28 George Smith
2: 2012-08-28 Henry Smith
3: 2012-08-28 John Doe
4: 2012-08-29 George Smith
5: 2012-08-29 Henry Smith
6: 2012-08-29 John Doe
7: 2012-08-30 George Smith
8: 2012-08-30 Henry Smith
9: 2012-08-30 John Doe
10: 2012-08-31 John Doe
11: 2012-08-31 George Smith
12: 2012-08-31 Henry Smith
13: 2012-09-01 John Doe
14: 2012-09-01 George Smith
15: 2012-09-01 Henry Smith
16: 2012-09-02 George Smith
17: 2012-09-02 Henry Smith
18: 2012-09-02 John Doe
19: 2012-09-03 Henry Smith
20: 2012-09-03 John Doe
21: 2012-09-03 George Smith
22: 2012-09-04 Henry Smith
23: 2012-09-04 John Doe
24: 2012-09-04 George Smith
25: 2012-09-05 George Smith
26: 2012-09-05 Henry Smith
27: 2012-09-05 John Doe
28: 2012-09-06 George Smith
29: 2012-09-06 Henry Smith
30: 2012-09-06 John Doe
31: 2012-09-07 George Smith
32: 2012-09-07 Henry Smith
33: 2012-09-07 John Doe
V1 V2

Obtain the same cross join with merge() and sqldf::sqldf()

I think including ORDER BY in sqldf is important, since it drives home the fact that in SQL, ordering is never guaranteed unless explicitly directed.

If you were doing simple ORDER BY with just "increasing" on both variables, then the translation to order in R would be direct. However, since one variable is decreasing and one is increasing, order by itself doesn't deal with that. However, as suggested by https://stackoverflow.com/a/3316719, we can do the same with xtfrm.

out1 <- merge(x = Sales, y = Clients, by = NULL)
out1 <- out1[order(-xtfrm(out1$State), out1$CustomerID.y),]

out2 <- sqldf::sqldf(
"SELECT *
FROM Sales
CROSS JOIN Clients
ORDER BY State DESC, Clients.CustomerID")

### proof they are identical
all(unlist(Map(`==`, out1, out2)))
# [1] TRUE

The xtfrm helper function here allows us to negate the "values" of a column for the purposes of sorting. From ?xtfrm:

A generic auxiliary function that produces a numeric vector which will sort in the same order as 'x'.

If the field were already numeric, we could merely do order(-State, CustomerID.y), but the fact that it is character requires a further step. Argo xtfrm.


Edit: in comments, it's determined that the OP wants to mimic the sort-order of merge in the SQL statement. Unfortunately, because this is a cartesian product of the two frames, no sorting is applied: merge merely cbinds all rows of the first frame against the first row of the second frame, then repeats with each row of the second.

This can be demonstrated by using some code from merge:

nx <- nrow(x) # Sales
ny <- nrow(y) # Clients
expand.grid(seq_len(nx), seq_len(ny))
# Var1 Var2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# 6 1 2
# ...
# 33 3 7
# 34 4 7
# 35 5 7

where each number is a row from the respective frames (x for Var1, y for Var2). If the original data is:

## Sales                        ## Clients        
Product CustomerID Price CustomerID State
1 Toaster 1_2019 37 1 2_2019 AZ
2 Radio 1_2019 33 2 3_2019 MA
3 Radio 2_2019 33 3 4_2019 AZ
4 TV 3_2019 408 4 1_2020 IL
5 Toaster 1_2020 37 5 2_2020 MA
6 TV 2_2020 408
7 TV 3_2020 408

then this results in

out1
# Product CustomerID.x Price CustomerID.y State
# 1 Toaster 1_2019 37 2_2019 AZ
# 2 Radio 1_2019 33 2_2019 AZ
# 3 Radio 2_2019 33 2_2019 AZ
# 4 TV 3_2019 408 2_2019 AZ
# 5 Toaster 1_2020 37 2_2019 AZ
# 6 TV 2_2020 408 2_2019 AZ
# 7 TV 3_2020 408 2_2019 AZ
# 8 Toaster 1_2019 37 3_2019 MA
# ...
# 33 Toaster 1_2020 37 2_2020 MA
# 34 TV 2_2020 408 2_2020 MA
# 35 TV 3_2020 408 2_2020 MA

which will very much destroy any sorting present in x (Sales), even if y (Clients) comes pre-sorted (which it does).

Because of this, if you want congruity between R and SQL cross-join solutions, I suggest the most transparent/clear way would be to merge in R and then apply post-merge ordering in a fashion that is similar to SQL. In fact, from a pedagogic perspective, ask the question: *"What ordering makes sense to humans?" If you assert during the lesson plan that ordering may not be assured until explicitly strong-armed into the process (via dplyr::arrange, x[order(...),], or SQL's ORDER BY clause). Find the intuitive ordering of the data and then demonstrate that in both R and SQL.

Side notes:

  1. Your sqldf query results in same-named columns, this results in some errors post-sqldf if you start playing with columns. This can be mitigated with select ... as ... field-naming.
  2. Lexicographic sorting of your data is unfortunately counter-intuitive at the moment: having year at the end of a customer id suggests (yes, I'm inferring) a timeline of customer onboarding, yet they will sort first by the leading number. Similar to how "2020-05-04" sorts correctly even as a string, while "05/04/2020" does not, it might support more intuitive sorting to have the most-significant portion be the leading part of id strings. Or make them integers. Or UUIDs (v4, of course), those are always fun.

cartesian product with dplyr R

Use crossing from the tidyr package:

x <- data.frame(x=c("a","b","c"))
y <- data.frame(y=c(1,2,3))

crossing(x, y)

Result:

   x y
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3

R data.table cross-join by three variables

You can also do this:

data[, .(date=dates_wanted), .(group,id)]

Output:

     group     id       date
1: A frank 2020-01-01
2: A frank 2020-01-02
3: A frank 2020-01-03
4: A frank 2020-01-04
5: A frank 2020-01-05
---
120: B edward 2020-01-27
121: B edward 2020-01-28
122: B edward 2020-01-29
123: B edward 2020-01-30
124: B edward 2020-01-31


Related Topics



Leave a reply



Submit