Which Data.Table Syntax For Left Join (One Column) to Prefer

Which data.table syntax for left join (one column) to prefer

I prefer the "update join" idiom for efficiency and maintainability:**

DT[WHERE, v := FROM[.SD, on=, x.v]]

It's an extension of what is shown in vignette("datatable-reference-semantics") under "Update some rows of columns by reference - sub-assign by reference". Once there is a vignette available on joins, that should also be a good reference.

This is efficient since it only uses the rows selected by WHERE and modifies or adds the column in-place, instead of making a new table like the more concise left join FROM[DT, on=].

It makes my code more readable since I can easily see that the point of the join is to add column v; and I don't have to think through "left"/"right" jargon from SQL or whether the number of rows is preserved after the join.

It is useful for code maintenance since if I later want to find out how DT got a column named v, I can search my code for v :=, while FROM[DT, on=] obscures which new columns are being added. Also, it allows the WHERE condition, while the left join does not. This may be useful, for example, if using FROM to "fill" NAs in an existing column v.


Compared with the other update join approach DT[FROM, on=, v := i.v], I can think of two advantages. First is the option of using the WHERE clause, and second is transparency through warnings when there are problems with the join, like duplicate matches in FROM conditional on the on= rules. Here's an illustration extending the OP's example:

library(data.table)
A <- data.table(id = letters[1:10], amount = rnorm(10)^2)
B2 <- data.table(
id = c("c", "d", "e", "e"),
ord = 1:4,
comment = c("big", "slow", "nice", "nooice")
)

# left-joiny update
A[B2, on=.(id), comment := i.comment, verbose=TRUE]
# Calculated ad hoc index in 0.000s elapsed (0.000s cpu)
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu)
# Detected that j uses these columns: comment,i.comment
# Assigning to 4 row subset of 10 rows

# my preferred update
A[, comment2 := B2[A, on=.(id), x.comment]]
# Warning message:
# In `[.data.table`(A, , `:=`(comment2, B2[A, on = .(id), x.comment])) :
# Supplied 11 items to be assigned to 10 items of column 'comment2' (1 unused)

id amount comment comment2
1: a 0.20000990 <NA> <NA>
2: b 1.42146573 <NA> <NA>
3: c 0.73047544 big big
4: d 0.04128676 slow slow
5: e 0.82195377 nooice nice
6: f 0.39013550 <NA> nooice
7: g 0.27019768 <NA> <NA>
8: h 0.36017876 <NA> <NA>
9: i 1.81865721 <NA> <NA>
10: j 4.86711754 <NA> <NA>

In the left-join-flavored update, you silently get the final value of comment even though there are two matches for id == "e"; while in the other update, you get a helpful warning message (upgraded to an error in a future release). Even turning on verbose=TRUE with the left-joiny approach is not informative -- it says there are four rows being updated but doesn't say that one row is being updated twice.


I find that this approach works best when my data is arranged into a set of tidy/relational tables. A good reference on that is Hadley Wickham's paper.

** In this idiom, the on= part should be filled in with the join column names and rules, like on=.(id) or on=.(from_date >= dt_date). Further join rules can be passed with roll=, mult= and nomatch=. See ?data.table for details. Thanks to @RYoda for noting this point in the comments.

Here is a more complicated example from Matt Dowle explaining roll=: Find time to nearest occurrence of particular value for each row

Another related example: Left join using data.table

Left join using data.table

You can try this:

# used data
# set the key in 'B' to the column which you use to join
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14, key = 'a')

B[A]

Intuition for data.table join syntax

Right Join is the default and the new object (X) is right joined?

The reason for that is consistency to base R way of subset of vectors/matrices. I think there is an entry in FAQ for that.
Notice when you use := during join you get left join. There is an issue which discuss consistency of merges with [ to base R, afair #1615.

Using data.table to left join with equality and inequality conditions, and multiple matches per left table row

If a SQL style left join (as detailed in the edit) is desired, this can be achieved using a code quite similar to icecreamtoucan's suggestion in the comments:

B[A,on=.(name = name, age > age)]

Note: if the result set exceeds the sum of the row counts of the elements of the join, data.table will assume you've made a mistake (unlike SQL engines) and throw an error. The solution (assuming you have not made an error) is to add allow.cartesian = TRUE.

Additionally, and unlike SQL, this join does not return all columns from the constituent tables. Instead (and somewhat frustratingly for those coming from a SQL background) column values from the left table used in the inequality condition of the join will be returned in columns with the names of the right table column compared to it in the inequality join condition!

The solution here (which I found some time ago in another SO answer but can't find now) is to create duplicates of the join columns you want to keep, use those for the join conditions then specify the columns to keep in the join.

e.g.

A <- data.table( group = rep("WIZARD LEAGUE",3)
,name = rep("Fred",time=3)
,status_start = as.Date("2017-01-01") + c(0,370,545)
,status_end = as.Date("2017-01-01") + c(369,544,365*3-1)
,status = c("UNEMPLOYED","EMPLOYED","RETIRED"))
A <- rbind(A, data.table( group = "WIZARD LEAGUE"
,name = "Sally"
,status_start = as.Date("2017-01-01")
,status_end = as.Date("2019-12-31")
,status = "CONTRACTED"))
> A
group name status_start status_end status
1: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED
2: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED
3: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED
4: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED


B <- data.table( group = rep("WIZARD LEAGUE",time=5)
,loc_start = as.Date("2017-01-01") + 180*0:4
,loc_end = as.Date("2017-01-01") + 180*1:5-1
, loc = c("US","GER","FRA","ITA","MOR"))

> B
group loc_start loc_end loc
1: WIZARD LEAGUE 2017-01-01 2017-06-29 US
2: WIZARD LEAGUE 2017-06-30 2017-12-26 GER
3: WIZARD LEAGUE 2017-12-27 2018-06-24 FRA
4: WIZARD LEAGUE 2018-06-25 2018-12-21 ITA
5: WIZARD LEAGUE 2018-12-22 2019-06-19 MOR

>#Try to join all rows whose date ranges intersect:

>B[A,on=.(group = group, loc_end >= status_start, loc_start <= status_end)]

Error in vecseq(f__, len__, if (allow.cartesian || notjoin ||
!anyDuplicated(f__, : Join results in 12 rows; more than 9 =
nrow(x)+nrow(i). Check for duplicate key values in i each of which
join to the same group in x over and over again. If that's ok, try
by=.EACHI to run j for each group to avoid the large allocation. If
you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
Otherwise, please search for this error message in the FAQ, Wiki,
Stack Overflow and data.table issue tracker for advice.

>#Try the join with allow.cartesian = TRUE
>#this succeeds but messes up column names

> B[A,on=.(group = group, loc_end >= status_start, loc_start <= status_end), allow.cartesian = TRUE]
group loc_start loc_end loc name status
1: WIZARD LEAGUE 2018-01-05 2017-01-01 US Fred UNEMPLOYED
2: WIZARD LEAGUE 2018-01-05 2017-01-01 GER Fred UNEMPLOYED
3: WIZARD LEAGUE 2018-01-05 2017-01-01 FRA Fred UNEMPLOYED
4: WIZARD LEAGUE 2018-06-29 2018-01-06 FRA Fred EMPLOYED
5: WIZARD LEAGUE 2018-06-29 2018-01-06 ITA Fred EMPLOYED
6: WIZARD LEAGUE 2019-12-31 2018-06-30 ITA Fred RETIRED
7: WIZARD LEAGUE 2019-12-31 2018-06-30 MOR Fred RETIRED
8: WIZARD LEAGUE 2019-12-31 2017-01-01 US Sally CONTRACTED
9: WIZARD LEAGUE 2019-12-31 2017-01-01 GER Sally CONTRACTED
10: WIZARD LEAGUE 2019-12-31 2017-01-01 FRA Sally CONTRACTED
11: WIZARD LEAGUE 2019-12-31 2017-01-01 ITA Sally CONTRACTED
12: WIZARD LEAGUE 2019-12-31 2017-01-01 MOR Sally CONTRACTED

>#Create aliased duplicates of the columns in the inequality condition
>#and specify the columns to keep

> keep_cols <- c(names(A),setdiff(names(B),names(A)))
> A[,start_dup := status_start]
> A[,end_dup := status_end]
> B[,start := loc_start]
> B[,end := loc_end]
>
>#Now the join works as expected (by SQL convention)
>
> B[ A
,..keep_cols
,on=.( group = group
,end >= start_dup
,start <= end_dup)
,allow.cartesian = TRUE]
group name status_start status_end status loc_start loc_end loc
1: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-01-01 2017-06-29 US
2: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-06-30 2017-12-26 GER
3: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-12-27 2018-06-24 FRA
4: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED 2017-12-27 2018-06-24 FRA
5: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED 2018-06-25 2018-12-21 ITA
6: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED 2018-06-25 2018-12-21 ITA
7: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED 2018-12-22 2019-06-19 MOR
8: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-01-01 2017-06-29 US
9: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-06-30 2017-12-26 GER
10: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-12-27 2018-06-24 FRA
11: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2018-06-25 2018-12-21 ITA
12: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2018-12-22 2019-06-19 MOR

I'm certainly not the first person to point out these departures from SQL convention, or that it is rather cumbersome to reproduce that functionality (as seen above), and I do believe improvements are actively being considered.

To anyone contemplating alternative strategies (e.g. the sqldf package) I will say that while there are meritorious alternatives to to data.table, I have struggled to find any solution that compares with the speed of data.table when very large datasets are involved, both with respect to joins as well as other operations. Needless to say there are many other benefits that make this package indispensable to me and many others. So for those working with large datasets I would advise against abandoning data.table joins if the above looks cumbersome and instead either get in the habit of going through these motions or else write a helper function that replicates the sequence of actions until an improvement to the syntax comes along.

Finally, I did not mention disjunctive joins here, but as far as I can tell this is another shortcoming of the data.table approach (and another area where sqldf is helpful). I have been getting around these with ad-hoc "hacks" of a sort, but I would appreciate any helpful advice on the best way to treat these in data.table.

left outer join with data.table with different names for key variables

From ?data.table::merge

This merge method for data.table behaves very similarly to that of data.frames with one major exception: By default, the columns used to merge the data.tables are the shared key columns rather than the shared columns with the same names. Set the by, or by.x, by.y arguments explicitly to override this default.

So we can use the by arguments to override the keys.

library(data.table)

DT1 = data.table(x1=c("b","c", "a", "b", "a", "b"), x2a=1:6,m1=seq(10,60,by=10))
DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2b=c(1,4,7,6," "," "),m2=5:10)

## you will get an error when joining a character to a integer:
DT2$x2b <- as.integer(DT2$x2b)
## Alternative:
## DT2 = data.table(x1=c("b","d", "c", "b","a","a"),x2b=c(1,4,7,6,NA,NA),m2=5:10)

merge(DT1, DT2, by.x=c('x1','x2a'), by.y=c('x1','x2b'), all.x=TRUE)

x1 x2a m1 m2
1: a 3 30 NA
2: a 5 50 NA
3: b 1 10 5
4: b 4 40 NA
5: b 6 60 8
6: c 2 20 NA

Data.table - left outer join on multiple tables

I just committed a new feature in data.table, v1.9.5, with which we can join without setting keys (that is, specify the columns to join by directly, without having to use setkey() first):

With that, this is simply:

require(data.table) # v1.9.5+
fruits[tastes, on="FruitID"][colors, on="FruitID"] # no setkey required
# FruitID Fruit TasteID Taste ColorID Color
# 1: 1 Apple 1 Sweeet 1 Red
# 2: 1 Apple 2 Sour 1 Red
# 3: 1 Apple 1 Sweeet 2 Yellow
# 4: 1 Apple 2 Sour 2 Yellow
# 5: 1 Apple 1 Sweeet 3 Green
# 6: 1 Apple 2 Sour 3 Green
# 7: 2 NA NA NA 4 Yellow
# 8: 3 Strawberry 3 Sweet 5 Red

Left join only selected columns in R with the merge() function

You can do this by subsetting the data you pass into your merge:

merge(x = DF1, y = DF2[ , c("Client", "LO")], by = "Client", all.x=TRUE)

Or you can simply delete the column after your current merge :)



Related Topics



Leave a reply



Submit