How to do cross join in R?
Is it just all=TRUE
?
x<-data.frame(id1=c("a","b","c"),vals1=1:3)
y<-data.frame(id2=c("d","e","f"),vals2=4:6)
merge(x,y,all=TRUE)
From documentation of merge
:
If by or both by.x and by.y are of length 0 (a length zero vector or NULL), the result, r, is the Cartesian product of x and y, i.e., dim(r) = c(nrow(x)*nrow(y), ncol(x) + ncol(y)).
Cross Join in dplyr in R
You just need a dummy column to join on:
cust_time$k <- 1
cust_time %>%
inner_join(cust_time, by='k') %>%
select(-k)
Or if you don't want to modify your original dataframe:
cust_time %>%
mutate(k = 1) %>%
replicate(2, ., simplify=FALSE) %>%
Reduce(function(a, b) inner_join(a, b, by='k'), .) %>%
select(-k)
How to Cross Join in R
We can use crossing
library(tidyr)
crossing(table1, table2)
Cartesian Product using data.table package
If you first construct full names from the first and last in the cust
-dataframe, you can then use CJ
(cross-join). You cannot use all three vectors since there would be 99 items and teh first names would get inappropriately mixed with last names.
> nrow(CJ(dates$date, cust$first.name, cust$last.name ) )
[1] 99
This returns the desired data.table object:
> CJ(dates$date,paste(cust$first.name, cust$last.name) )
V1 V2
1: 2012-08-28 George Smith
2: 2012-08-28 Henry Smith
3: 2012-08-28 John Doe
4: 2012-08-29 George Smith
5: 2012-08-29 Henry Smith
6: 2012-08-29 John Doe
7: 2012-08-30 George Smith
8: 2012-08-30 Henry Smith
9: 2012-08-30 John Doe
10: 2012-08-31 John Doe
11: 2012-08-31 George Smith
12: 2012-08-31 Henry Smith
13: 2012-09-01 John Doe
14: 2012-09-01 George Smith
15: 2012-09-01 Henry Smith
16: 2012-09-02 George Smith
17: 2012-09-02 Henry Smith
18: 2012-09-02 John Doe
19: 2012-09-03 Henry Smith
20: 2012-09-03 John Doe
21: 2012-09-03 George Smith
22: 2012-09-04 Henry Smith
23: 2012-09-04 John Doe
24: 2012-09-04 George Smith
25: 2012-09-05 George Smith
26: 2012-09-05 Henry Smith
27: 2012-09-05 John Doe
28: 2012-09-06 George Smith
29: 2012-09-06 Henry Smith
30: 2012-09-06 John Doe
31: 2012-09-07 George Smith
32: 2012-09-07 Henry Smith
33: 2012-09-07 John Doe
V1 V2
Obtain the same cross join with merge() and sqldf::sqldf()
I think including ORDER BY
in sqldf
is important, since it drives home the fact that in SQL, ordering is never guaranteed unless explicitly directed.
If you were doing simple ORDER BY
with just "increasing" on both variables, then the translation to order
in R would be direct. However, since one variable is decreasing and one is increasing, order
by itself doesn't deal with that. However, as suggested by https://stackoverflow.com/a/3316719, we can do the same with xtfrm
.
out1 <- merge(x = Sales, y = Clients, by = NULL)
out1 <- out1[order(-xtfrm(out1$State), out1$CustomerID.y),]
out2 <- sqldf::sqldf(
"SELECT *
FROM Sales
CROSS JOIN Clients
ORDER BY State DESC, Clients.CustomerID")
### proof they are identical
all(unlist(Map(`==`, out1, out2)))
# [1] TRUE
The xtfrm
helper function here allows us to negate the "values" of a column for the purposes of sorting. From ?xtfrm
:
A generic auxiliary function that produces a numeric vector which will sort in the same order as 'x'.
If the field were already numeric, we could merely do order(-State, CustomerID.y)
, but the fact that it is character
requires a further step. Argo xtfrm
.
Edit: in comments, it's determined that the OP wants to mimic the sort-order of merge
in the SQL statement. Unfortunately, because this is a cartesian product of the two frames, no sorting is applied: merge
merely cbind
s all rows of the first frame against the first row of the second frame, then repeats with each row of the second.
This can be demonstrated by using some code from merge
:
nx <- nrow(x) # Sales
ny <- nrow(y) # Clients
expand.grid(seq_len(nx), seq_len(ny))
# Var1 Var2
# 1 1 1
# 2 2 1
# 3 3 1
# 4 4 1
# 5 5 1
# 6 1 2
# ...
# 33 3 7
# 34 4 7
# 35 5 7
where each number is a row from the respective frames (x
for Var1
, y
for Var2
). If the original data is:
## Sales ## Clients
Product CustomerID Price CustomerID State
1 Toaster 1_2019 37 1 2_2019 AZ
2 Radio 1_2019 33 2 3_2019 MA
3 Radio 2_2019 33 3 4_2019 AZ
4 TV 3_2019 408 4 1_2020 IL
5 Toaster 1_2020 37 5 2_2020 MA
6 TV 2_2020 408
7 TV 3_2020 408
then this results in
out1
# Product CustomerID.x Price CustomerID.y State
# 1 Toaster 1_2019 37 2_2019 AZ
# 2 Radio 1_2019 33 2_2019 AZ
# 3 Radio 2_2019 33 2_2019 AZ
# 4 TV 3_2019 408 2_2019 AZ
# 5 Toaster 1_2020 37 2_2019 AZ
# 6 TV 2_2020 408 2_2019 AZ
# 7 TV 3_2020 408 2_2019 AZ
# 8 Toaster 1_2019 37 3_2019 MA
# ...
# 33 Toaster 1_2020 37 2_2020 MA
# 34 TV 2_2020 408 2_2020 MA
# 35 TV 3_2020 408 2_2020 MA
which will very much destroy any sorting present in x
(Sales
), even if y
(Clients
) comes pre-sorted (which it does).
Because of this, if you want congruity between R and SQL cross-join solutions, I suggest the most transparent/clear way would be to merge
in R and then apply post-merge
ordering in a fashion that is similar to SQL. In fact, from a pedagogic perspective, ask the question: *"What ordering makes sense to humans?" If you assert during the lesson plan that ordering may not be assured until explicitly strong-armed into the process (via dplyr::arrange
, x[order(...),]
, or SQL's ORDER BY
clause). Find the intuitive ordering of the data and then demonstrate that in both R and SQL.
Side notes:
- Your
sqldf
query results in same-named columns, this results in some errors post-sqldf
if you start playing with columns. This can be mitigated withselect ... as ...
field-naming. - Lexicographic sorting of your data is unfortunately counter-intuitive at the moment: having year at the end of a customer id suggests (yes, I'm inferring) a timeline of customer onboarding, yet they will sort first by the leading number. Similar to how
"2020-05-04"
sorts correctly even as a string, while"05/04/2020"
does not, it might support more intuitive sorting to have the most-significant portion be the leading part of id strings. Or make them integers. Or UUIDs (v4, of course), those are always fun.
cartesian product with dplyr R
Use crossing from the tidyr
package:
x <- data.frame(x=c("a","b","c"))
y <- data.frame(y=c(1,2,3))
crossing(x, y)
Result:
x y
1 a 1
2 a 2
3 a 3
4 b 1
5 b 2
6 b 3
7 c 1
8 c 2
9 c 3
R data.table cross-join by three variables
You can also do this:
data[, .(date=dates_wanted), .(group,id)]
Output:
group id date
1: A frank 2020-01-01
2: A frank 2020-01-02
3: A frank 2020-01-03
4: A frank 2020-01-04
5: A frank 2020-01-05
---
120: B edward 2020-01-27
121: B edward 2020-01-28
122: B edward 2020-01-29
123: B edward 2020-01-30
124: B edward 2020-01-31
Related Topics
Capitalize the First Letter of Both Words in a Two Word String
Calculate the Mean For Each Column of a Matrix in R
Shifting Non-Na Cells to the Left
Geom_Rect and Alpha - Does This Work With Hard Coded Values
Difference Between the == and %In% Operators in R
Create Group Names For Consecutive Values
Trimming a Huge (3.5 Gb) CSV File to Read into R
Ggplot2 Geom_Bar - How to Keep Order of Data.Frame
Convert Type of Multiple Columns of a Dataframe At Once
How to Match Fuzzy Match Strings from Two Datasets
Select Equivalent Rows [A-B & B-A]
How to Number/Label Data-Table by Group-Number from Group_By
Replace Missing Values With Column Mean
Dplyr: Nonstandard Column Names (White Space, Punctuation, Starts With Numbers)
How to Unload a Package Without Restarting R