Left Join Using Data.Table

Left join using data.table

You can try this:

# used data
# set the key in 'B' to the column which you use to join
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14, key = 'a')

B[A]

Which data.table syntax for left join (one column) to prefer

I prefer the "update join" idiom for efficiency and maintainability:**

DT[WHERE, v := FROM[.SD, on=, x.v]]

It's an extension of what is shown in vignette("datatable-reference-semantics") under "Update some rows of columns by reference - sub-assign by reference". Once there is a vignette available on joins, that should also be a good reference.

This is efficient since it only uses the rows selected by WHERE and modifies or adds the column in-place, instead of making a new table like the more concise left join FROM[DT, on=].

It makes my code more readable since I can easily see that the point of the join is to add column v; and I don't have to think through "left"/"right" jargon from SQL or whether the number of rows is preserved after the join.

It is useful for code maintenance since if I later want to find out how DT got a column named v, I can search my code for v :=, while FROM[DT, on=] obscures which new columns are being added. Also, it allows the WHERE condition, while the left join does not. This may be useful, for example, if using FROM to "fill" NAs in an existing column v.


Compared with the other update join approach DT[FROM, on=, v := i.v], I can think of two advantages. First is the option of using the WHERE clause, and second is transparency through warnings when there are problems with the join, like duplicate matches in FROM conditional on the on= rules. Here's an illustration extending the OP's example:

library(data.table)
A <- data.table(id = letters[1:10], amount = rnorm(10)^2)
B2 <- data.table(
id = c("c", "d", "e", "e"),
ord = 1:4,
comment = c("big", "slow", "nice", "nooice")
)

# left-joiny update
A[B2, on=.(id), comment := i.comment, verbose=TRUE]
# Calculated ad hoc index in 0.000s elapsed (0.000s cpu)
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu)
# Detected that j uses these columns: comment,i.comment
# Assigning to 4 row subset of 10 rows

# my preferred update
A[, comment2 := B2[A, on=.(id), x.comment]]
# Warning message:
# In `[.data.table`(A, , `:=`(comment2, B2[A, on = .(id), x.comment])) :
# Supplied 11 items to be assigned to 10 items of column 'comment2' (1 unused)

id amount comment comment2
1: a 0.20000990 <NA> <NA>
2: b 1.42146573 <NA> <NA>
3: c 0.73047544 big big
4: d 0.04128676 slow slow
5: e 0.82195377 nooice nice
6: f 0.39013550 <NA> nooice
7: g 0.27019768 <NA> <NA>
8: h 0.36017876 <NA> <NA>
9: i 1.81865721 <NA> <NA>
10: j 4.86711754 <NA> <NA>

In the left-join-flavored update, you silently get the final value of comment even though there are two matches for id == "e"; while in the other update, you get a helpful warning message (upgraded to an error in a future release). Even turning on verbose=TRUE with the left-joiny approach is not informative -- it says there are four rows being updated but doesn't say that one row is being updated twice.


I find that this approach works best when my data is arranged into a set of tidy/relational tables. A good reference on that is Hadley Wickham's paper.

** In this idiom, the on= part should be filled in with the join column names and rules, like on=.(id) or on=.(from_date >= dt_date). Further join rules can be passed with roll=, mult= and nomatch=. See ?data.table for details. Thanks to @RYoda for noting this point in the comments.

Here is a more complicated example from Matt Dowle explaining roll=: Find time to nearest occurrence of particular value for each row

Another related example: Left join using data.table

Left join using Data Table

Similar option in data.table would be

library(data.table)
out <- as.data.table(airports)[as.data.table(flights)[, .SD,
.SDcols = c("year", "month", "day", "hour", "origin", "dest", "carrier")],
on = .(faa = dest)]
all.equal(dim(left_join_tv), dim(out))
#[1] TRUE

Using data.table to left join with equality and inequality conditions, and multiple matches per left table row

If a SQL style left join (as detailed in the edit) is desired, this can be achieved using a code quite similar to icecreamtoucan's suggestion in the comments:

B[A,on=.(name = name, age > age)]

Note: if the result set exceeds the sum of the row counts of the elements of the join, data.table will assume you've made a mistake (unlike SQL engines) and throw an error. The solution (assuming you have not made an error) is to add allow.cartesian = TRUE.

Additionally, and unlike SQL, this join does not return all columns from the constituent tables. Instead (and somewhat frustratingly for those coming from a SQL background) column values from the left table used in the inequality condition of the join will be returned in columns with the names of the right table column compared to it in the inequality join condition!

The solution here (which I found some time ago in another SO answer but can't find now) is to create duplicates of the join columns you want to keep, use those for the join conditions then specify the columns to keep in the join.

e.g.

A <- data.table( group = rep("WIZARD LEAGUE",3)
,name = rep("Fred",time=3)
,status_start = as.Date("2017-01-01") + c(0,370,545)
,status_end = as.Date("2017-01-01") + c(369,544,365*3-1)
,status = c("UNEMPLOYED","EMPLOYED","RETIRED"))
A <- rbind(A, data.table( group = "WIZARD LEAGUE"
,name = "Sally"
,status_start = as.Date("2017-01-01")
,status_end = as.Date("2019-12-31")
,status = "CONTRACTED"))
> A
group name status_start status_end status
1: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED
2: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED
3: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED
4: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED


B <- data.table( group = rep("WIZARD LEAGUE",time=5)
,loc_start = as.Date("2017-01-01") + 180*0:4
,loc_end = as.Date("2017-01-01") + 180*1:5-1
, loc = c("US","GER","FRA","ITA","MOR"))

> B
group loc_start loc_end loc
1: WIZARD LEAGUE 2017-01-01 2017-06-29 US
2: WIZARD LEAGUE 2017-06-30 2017-12-26 GER
3: WIZARD LEAGUE 2017-12-27 2018-06-24 FRA
4: WIZARD LEAGUE 2018-06-25 2018-12-21 ITA
5: WIZARD LEAGUE 2018-12-22 2019-06-19 MOR

>#Try to join all rows whose date ranges intersect:

>B[A,on=.(group = group, loc_end >= status_start, loc_start <= status_end)]

Error in vecseq(f__, len__, if (allow.cartesian || notjoin ||
!anyDuplicated(f__, : Join results in 12 rows; more than 9 =
nrow(x)+nrow(i). Check for duplicate key values in i each of which
join to the same group in x over and over again. If that's ok, try
by=.EACHI to run j for each group to avoid the large allocation. If
you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
Otherwise, please search for this error message in the FAQ, Wiki,
Stack Overflow and data.table issue tracker for advice.

>#Try the join with allow.cartesian = TRUE
>#this succeeds but messes up column names

> B[A,on=.(group = group, loc_end >= status_start, loc_start <= status_end), allow.cartesian = TRUE]
group loc_start loc_end loc name status
1: WIZARD LEAGUE 2018-01-05 2017-01-01 US Fred UNEMPLOYED
2: WIZARD LEAGUE 2018-01-05 2017-01-01 GER Fred UNEMPLOYED
3: WIZARD LEAGUE 2018-01-05 2017-01-01 FRA Fred UNEMPLOYED
4: WIZARD LEAGUE 2018-06-29 2018-01-06 FRA Fred EMPLOYED
5: WIZARD LEAGUE 2018-06-29 2018-01-06 ITA Fred EMPLOYED
6: WIZARD LEAGUE 2019-12-31 2018-06-30 ITA Fred RETIRED
7: WIZARD LEAGUE 2019-12-31 2018-06-30 MOR Fred RETIRED
8: WIZARD LEAGUE 2019-12-31 2017-01-01 US Sally CONTRACTED
9: WIZARD LEAGUE 2019-12-31 2017-01-01 GER Sally CONTRACTED
10: WIZARD LEAGUE 2019-12-31 2017-01-01 FRA Sally CONTRACTED
11: WIZARD LEAGUE 2019-12-31 2017-01-01 ITA Sally CONTRACTED
12: WIZARD LEAGUE 2019-12-31 2017-01-01 MOR Sally CONTRACTED

>#Create aliased duplicates of the columns in the inequality condition
>#and specify the columns to keep

> keep_cols <- c(names(A),setdiff(names(B),names(A)))
> A[,start_dup := status_start]
> A[,end_dup := status_end]
> B[,start := loc_start]
> B[,end := loc_end]
>
>#Now the join works as expected (by SQL convention)
>
> B[ A
,..keep_cols
,on=.( group = group
,end >= start_dup
,start <= end_dup)
,allow.cartesian = TRUE]
group name status_start status_end status loc_start loc_end loc
1: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-01-01 2017-06-29 US
2: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-06-30 2017-12-26 GER
3: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-12-27 2018-06-24 FRA
4: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED 2017-12-27 2018-06-24 FRA
5: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED 2018-06-25 2018-12-21 ITA
6: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED 2018-06-25 2018-12-21 ITA
7: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED 2018-12-22 2019-06-19 MOR
8: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-01-01 2017-06-29 US
9: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-06-30 2017-12-26 GER
10: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-12-27 2018-06-24 FRA
11: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2018-06-25 2018-12-21 ITA
12: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2018-12-22 2019-06-19 MOR

I'm certainly not the first person to point out these departures from SQL convention, or that it is rather cumbersome to reproduce that functionality (as seen above), and I do believe improvements are actively being considered.

To anyone contemplating alternative strategies (e.g. the sqldf package) I will say that while there are meritorious alternatives to to data.table, I have struggled to find any solution that compares with the speed of data.table when very large datasets are involved, both with respect to joins as well as other operations. Needless to say there are many other benefits that make this package indispensable to me and many others. So for those working with large datasets I would advise against abandoning data.table joins if the above looks cumbersome and instead either get in the habit of going through these motions or else write a helper function that replicates the sequence of actions until an improvement to the syntax comes along.

Finally, I did not mention disjunctive joins here, but as far as I can tell this is another shortcoming of the data.table approach (and another area where sqldf is helpful). I have been getting around these with ad-hoc "hacks" of a sort, but I would appreciate any helpful advice on the best way to treat these in data.table.

Conditional joins in data.table - left join with column selection

You can also include the selection of variables within the first query.

DT2[DT1, .(x, y, z, t), on = .(a = x, y_start <= y, y_end >= y)]
# x y z t
#1: 1 15 -0.6264538 NA
#2: 2 25 0.1836433 1206
#3: 3 35 -0.8356286 NA
#4: 4 45 1.5952808 NA
#5: 5 55 0.3295078 1686

Regards!

Left join 2 datatable

here is a solution:

 var results = from ta in tableA
join tb in tableB on ta.id equals tb.id
select new
{
ta.id,
ta.Name,
tb.Age,
tb.Height
};

results will be your new data that you are expecting.

Why does X[Y] join of data.tables not allow a full outer join, or a left join?

To quote from the data.table FAQ 1.11 What is the difference between X[Y] and merge(X, Y)?

X[Y] is a join, looking up X's rows using Y (or Y's key if it has one) as an index.

Y[X] is a join, looking up Y's rows using X (or X's key if it has one)

merge(X,Y) does both ways at the same time. The number of rows of X[Y] and Y[X] usually differ, whereas the number of rows returned by merge(X,Y) and merge(Y,X) is the same.

BUT that misses the main point. Most tasks require something to be done on the
data after a join or merge. Why merge all the columns of data, only to
use a small subset of them afterwards? You may suggest
merge(X[,ColsNeeded1],Y[,ColsNeeded2]), but that requires the programmer to work out which columns are needed. X[Y,j] in data.table does all that in one step for
you. When you write X[Y,sum(foo*bar)], data.table automatically inspects the j expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns the j uses, and Y columns enjoy standard R recycling rules within the context of each group. Let's say foo is in X, and bar is in Y (along with 20 other columns in Y). Isn't X[Y,sum(foo*bar)] quicker to program and quicker to run than a merge of everything wastefully followed by a subset?


If you want a left outer join of X[Y]

le <- Y[X]
mallx <- merge(X, Y, all.x = T)
# the column order is different so change to be the same as `merge`
setcolorder(le, names(mallx))
identical(le, mallx)
# [1] TRUE

If you want a full outer join

# the unique values for the keys over both data sets
unique_keys <- unique(c(X[,t], Y[,t]))
Y[X[J(unique_keys)]]
## t b a
## 1: 1 NA 1
## 2: 2 NA 4
## 3: 3 9 9
## 4: 4 16 16
## 5: 5 25 NA
## 6: 6 36 NA

# The following will give the same with the column order X,Y
X[Y[J(unique_keys)]]

Efficiently joining two data.tables in R (or SQL tables) on date ranges?

data.table

For data.table, this is mostly a dupe of How to perform join over date ranges using data.table?, though that doesn't provide the RHS[LHS, on=.(..)] method.

observations
# dt_taken patient_id observation value
# 1 2020-04-13 00:00:00 patient01 Heart rate 69
admissions
# patient_id admission_id startdate enddate
# 1 patient01 admission01 2020-04-01 00:04:20 2020-05-01 00:23:59

### convert to data.table
setDT(observations)
setDT(admissions)

### we need proper 'POSIXt' objects
observations[, dt_taken := as.POSIXct(dt_taken)]
admissions[, (dates) := lapply(.SD, as.POSIXct), .SDcols = dates]

And the join.

admissions[observations, on = .(patient_id, startdate <= dt_taken, enddate >= dt_taken)]
# patient_id admission_id startdate enddate observation value
# <char> <char> <POSc> <POSc> <char> <int>
# 1: patient01 admission01 2020-04-13 2020-04-13 Heart rate 69

Two things that I believe are noteworthy:

  • in SQL (and similarly in other join-friendly languages), it is often shown as

    select ...
    from TABLE1 left join TABLE2 ...

    suggesting that TABLE1 is the LHS (left-hand side) and TABLE2 is the RHS table. (This is a gross generalization, mostly gearing towards a left-join since that's all that data.table::[ supports; for inner/outer/full joins, you'll need merge(.) or other external mechanisms. See How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272 for more discussion on JOINs, etc.)

    From this, data.table::['s mechanism is effectively

    TABLE2[TABLE1, on = .(...)]
    RHS[LHS, on = .(...)]

    (Meaning that the right-hand-side table is actually the first table from left-to-right ...)


  1. The names in the output of inequi-joins are preserved from the RHS, see that dt_taken is not found. However, the values of those startdate and enddate columns are from dt_taken.

    Because of this, I've often found the simplest way for me to wrap my brain around the renaming and values and such is when I'm not certain, I copy a join column into a new column and join using that column, then delete it post-merge. It's sloppy and lazy, but I've caught myself too many times missing something and thinking it was not what I had thought.

sqldf

This might be a little more direct if SQL seems more intuitive.

sqldf::sqldf(
"select ob.*, ad.admission_id
from observations ob
left join admissions ad on ob.patient_id=ad.patient_id
and ob.dt_taken between ad.startdate and ad.enddate")
# dt_taken patient_id observation value admission_id
# 1 2020-04-13 patient01 Heart rate 69 admission01

Data (already data.table with POSIXt, works just as well with sqldf though regular data.frames will work just fine, too):

admissions <- setDT(structure(list(patient_id = "patient01", admission_id = "admission01", startdate = structure(1585713860, class = c("POSIXct", "POSIXt" ), tzone = ""), enddate = structure(1588307039, class = c("POSIXct", "POSIXt"), tzone = "")), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))
observations <- setDT(structure(list(dt_taken = structure(1586750400, class = c("POSIXct", "POSIXt"), tzone = ""), patient_id = "patient01", observation = "Heart rate", value = 69L), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))

(I use setDT to repair the fact that we can't pass the .internal.selfref attribute here.)



Related Topics



Leave a reply



Submit