Left join using data.table
You can try this:
# used data
# set the key in 'B' to the column which you use to join
A <- data.table(a = 1:4, b = 12:15)
B <- data.table(a = 2:3, b = 13:14, key = 'a')
B[A]
Which data.table syntax for left join (one column) to prefer
I prefer the "update join" idiom for efficiency and maintainability:**
DT[WHERE, v := FROM[.SD, on=, x.v]]
It's an extension of what is shown in vignette("datatable-reference-semantics")
under "Update some rows of columns by reference - sub-assign by reference". Once there is a vignette available on joins, that should also be a good reference.
This is efficient since it only uses the rows selected by WHERE
and modifies or adds the column in-place, instead of making a new table like the more concise left join FROM[DT, on=]
.
It makes my code more readable since I can easily see that the point of the join is to add column v
; and I don't have to think through "left"/"right" jargon from SQL or whether the number of rows is preserved after the join.
It is useful for code maintenance since if I later want to find out how DT
got a column named v
, I can search my code for v :=
, while FROM[DT, on=]
obscures which new columns are being added. Also, it allows the WHERE
condition, while the left join does not. This may be useful, for example, if using FROM
to "fill" NAs in an existing column v
.
Compared with the other update join approach DT[FROM, on=, v := i.v]
, I can think of two advantages. First is the option of using the WHERE
clause, and second is transparency through warnings when there are problems with the join, like duplicate matches in FROM
conditional on the on=
rules. Here's an illustration extending the OP's example:
library(data.table)
A <- data.table(id = letters[1:10], amount = rnorm(10)^2)
B2 <- data.table(
id = c("c", "d", "e", "e"),
ord = 1:4,
comment = c("big", "slow", "nice", "nooice")
)
# left-joiny update
A[B2, on=.(id), comment := i.comment, verbose=TRUE]
# Calculated ad hoc index in 0.000s elapsed (0.000s cpu)
# Starting bmerge ...done in 0.000s elapsed (0.000s cpu)
# Detected that j uses these columns: comment,i.comment
# Assigning to 4 row subset of 10 rows
# my preferred update
A[, comment2 := B2[A, on=.(id), x.comment]]
# Warning message:
# In `[.data.table`(A, , `:=`(comment2, B2[A, on = .(id), x.comment])) :
# Supplied 11 items to be assigned to 10 items of column 'comment2' (1 unused)
id amount comment comment2
1: a 0.20000990 <NA> <NA>
2: b 1.42146573 <NA> <NA>
3: c 0.73047544 big big
4: d 0.04128676 slow slow
5: e 0.82195377 nooice nice
6: f 0.39013550 <NA> nooice
7: g 0.27019768 <NA> <NA>
8: h 0.36017876 <NA> <NA>
9: i 1.81865721 <NA> <NA>
10: j 4.86711754 <NA> <NA>
In the left-join-flavored update, you silently get the final value of comment
even though there are two matches for id == "e"
; while in the other update, you get a helpful warning message (upgraded to an error in a future release). Even turning on verbose=TRUE
with the left-joiny approach is not informative -- it says there are four rows being updated but doesn't say that one row is being updated twice.
I find that this approach works best when my data is arranged into a set of tidy/relational tables. A good reference on that is Hadley Wickham's paper.
** In this idiom, the on=
part should be filled in with the join column names and rules, like on=.(id)
or on=.(from_date >= dt_date)
. Further join rules can be passed with roll=
, mult=
and nomatch=
. See ?data.table
for details. Thanks to @RYoda for noting this point in the comments.
Here is a more complicated example from Matt Dowle explaining roll=
: Find time to nearest occurrence of particular value for each row
Another related example: Left join using data.table
Left join using Data Table
Similar option in data.table
would be
library(data.table)
out <- as.data.table(airports)[as.data.table(flights)[, .SD,
.SDcols = c("year", "month", "day", "hour", "origin", "dest", "carrier")],
on = .(faa = dest)]
all.equal(dim(left_join_tv), dim(out))
#[1] TRUE
Using data.table to left join with equality and inequality conditions, and multiple matches per left table row
If a SQL style left join (as detailed in the edit) is desired, this can be achieved using a code quite similar to icecreamtoucan's suggestion in the comments:
B[A,on=.(name = name, age > age)]
Note: if the result set exceeds the sum of the row counts of the elements of the join, data.table
will assume you've made a mistake (unlike SQL engines) and throw an error. The solution (assuming you have not made an error) is to add allow.cartesian = TRUE
.
Additionally, and unlike SQL, this join does not return all columns from the constituent tables. Instead (and somewhat frustratingly for those coming from a SQL background) column values from the left table used in the inequality condition of the join will be returned in columns with the names of the right table column compared to it in the inequality join condition!
The solution here (which I found some time ago in another SO answer but can't find now) is to create duplicates of the join columns you want to keep, use those for the join conditions then specify the columns to keep in the join.
e.g.
A <- data.table( group = rep("WIZARD LEAGUE",3)
,name = rep("Fred",time=3)
,status_start = as.Date("2017-01-01") + c(0,370,545)
,status_end = as.Date("2017-01-01") + c(369,544,365*3-1)
,status = c("UNEMPLOYED","EMPLOYED","RETIRED"))
A <- rbind(A, data.table( group = "WIZARD LEAGUE"
,name = "Sally"
,status_start = as.Date("2017-01-01")
,status_end = as.Date("2019-12-31")
,status = "CONTRACTED"))
> A
group name status_start status_end status
1: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED
2: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED
3: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED
4: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED
B <- data.table( group = rep("WIZARD LEAGUE",time=5)
,loc_start = as.Date("2017-01-01") + 180*0:4
,loc_end = as.Date("2017-01-01") + 180*1:5-1
, loc = c("US","GER","FRA","ITA","MOR"))
> B
group loc_start loc_end loc
1: WIZARD LEAGUE 2017-01-01 2017-06-29 US
2: WIZARD LEAGUE 2017-06-30 2017-12-26 GER
3: WIZARD LEAGUE 2017-12-27 2018-06-24 FRA
4: WIZARD LEAGUE 2018-06-25 2018-12-21 ITA
5: WIZARD LEAGUE 2018-12-22 2019-06-19 MOR
>#Try to join all rows whose date ranges intersect:
>B[A,on=.(group = group, loc_end >= status_start, loc_start <= status_end)]
Error in vecseq(f__, len__, if (allow.cartesian || notjoin ||
!anyDuplicated(f__, : Join results in 12 rows; more than 9 =
nrow(x)+nrow(i). Check for duplicate key values in i each of which
join to the same group in x over and over again. If that's ok, try
by=.EACHI to run j for each group to avoid the large allocation. If
you are sure you wish to proceed, rerun with allow.cartesian=TRUE.
Otherwise, please search for this error message in the FAQ, Wiki,
Stack Overflow and data.table issue tracker for advice.
>#Try the join with allow.cartesian = TRUE
>#this succeeds but messes up column names
> B[A,on=.(group = group, loc_end >= status_start, loc_start <= status_end), allow.cartesian = TRUE]
group loc_start loc_end loc name status
1: WIZARD LEAGUE 2018-01-05 2017-01-01 US Fred UNEMPLOYED
2: WIZARD LEAGUE 2018-01-05 2017-01-01 GER Fred UNEMPLOYED
3: WIZARD LEAGUE 2018-01-05 2017-01-01 FRA Fred UNEMPLOYED
4: WIZARD LEAGUE 2018-06-29 2018-01-06 FRA Fred EMPLOYED
5: WIZARD LEAGUE 2018-06-29 2018-01-06 ITA Fred EMPLOYED
6: WIZARD LEAGUE 2019-12-31 2018-06-30 ITA Fred RETIRED
7: WIZARD LEAGUE 2019-12-31 2018-06-30 MOR Fred RETIRED
8: WIZARD LEAGUE 2019-12-31 2017-01-01 US Sally CONTRACTED
9: WIZARD LEAGUE 2019-12-31 2017-01-01 GER Sally CONTRACTED
10: WIZARD LEAGUE 2019-12-31 2017-01-01 FRA Sally CONTRACTED
11: WIZARD LEAGUE 2019-12-31 2017-01-01 ITA Sally CONTRACTED
12: WIZARD LEAGUE 2019-12-31 2017-01-01 MOR Sally CONTRACTED
>#Create aliased duplicates of the columns in the inequality condition
>#and specify the columns to keep
> keep_cols <- c(names(A),setdiff(names(B),names(A)))
> A[,start_dup := status_start]
> A[,end_dup := status_end]
> B[,start := loc_start]
> B[,end := loc_end]
>
>#Now the join works as expected (by SQL convention)
>
> B[ A
,..keep_cols
,on=.( group = group
,end >= start_dup
,start <= end_dup)
,allow.cartesian = TRUE]
group name status_start status_end status loc_start loc_end loc
1: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-01-01 2017-06-29 US
2: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-06-30 2017-12-26 GER
3: WIZARD LEAGUE Fred 2017-01-01 2018-01-05 UNEMPLOYED 2017-12-27 2018-06-24 FRA
4: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED 2017-12-27 2018-06-24 FRA
5: WIZARD LEAGUE Fred 2018-01-06 2018-06-29 EMPLOYED 2018-06-25 2018-12-21 ITA
6: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED 2018-06-25 2018-12-21 ITA
7: WIZARD LEAGUE Fred 2018-06-30 2019-12-31 RETIRED 2018-12-22 2019-06-19 MOR
8: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-01-01 2017-06-29 US
9: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-06-30 2017-12-26 GER
10: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2017-12-27 2018-06-24 FRA
11: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2018-06-25 2018-12-21 ITA
12: WIZARD LEAGUE Sally 2017-01-01 2019-12-31 CONTRACTED 2018-12-22 2019-06-19 MOR
I'm certainly not the first person to point out these departures from SQL convention, or that it is rather cumbersome to reproduce that functionality (as seen above), and I do believe improvements are actively being considered.
To anyone contemplating alternative strategies (e.g. the sqldf
package) I will say that while there are meritorious alternatives to to data.table
, I have struggled to find any solution that compares with the speed of data.table
when very large datasets are involved, both with respect to joins as well as other operations. Needless to say there are many other benefits that make this package indispensable to me and many others. So for those working with large datasets I would advise against abandoning data.table
joins if the above looks cumbersome and instead either get in the habit of going through these motions or else write a helper function that replicates the sequence of actions until an improvement to the syntax comes along.
Finally, I did not mention disjunctive joins here, but as far as I can tell this is another shortcoming of the data.table
approach (and another area where sqldf
is helpful). I have been getting around these with ad-hoc "hacks" of a sort, but I would appreciate any helpful advice on the best way to treat these in data.table
.
Conditional joins in data.table - left join with column selection
You can also include the selection of variables within the first query.
DT2[DT1, .(x, y, z, t), on = .(a = x, y_start <= y, y_end >= y)]
# x y z t
#1: 1 15 -0.6264538 NA
#2: 2 25 0.1836433 1206
#3: 3 35 -0.8356286 NA
#4: 4 45 1.5952808 NA
#5: 5 55 0.3295078 1686
Regards!
Left join 2 datatable
here is a solution:
var results = from ta in tableA
join tb in tableB on ta.id equals tb.id
select new
{
ta.id,
ta.Name,
tb.Age,
tb.Height
};
results
will be your new data that you are expecting.
Why does X[Y] join of data.tables not allow a full outer join, or a left join?
To quote from the data.table
FAQ 1.11 What is the difference between X[Y]
and merge(X, Y)
?
X[Y]
is a join, looking up X's rows using Y (or Y's key if it has one) as an index.
Y[X]
is a join, looking up Y's rows using X (or X's key if it has one)
merge(X,Y)
does both ways at the same time. The number of rows ofX[Y]
andY[X]
usually differ, whereas the number of rows returned bymerge(X,Y)
andmerge(Y,X)
is the same.BUT that misses the main point. Most tasks require something to be done on the
data after a join or merge. Why merge all the columns of data, only to
use a small subset of them afterwards? You may suggest
merge(X[,ColsNeeded1],Y[,ColsNeeded2])
, but that requires the programmer to work out which columns are needed.X[Y,j
] in data.table does all that in one step for
you. When you writeX[Y,sum(foo*bar)]
, data.table automatically inspects thej
expression to see which columns it uses. It will only subset those columns only; the others are ignored. Memory is only created for the columns thej
uses, andY
columns enjoy standard R recycling rules within the context of each group. Let's sayfoo
is inX
, and bar is inY
(along with 20 other columns inY
). Isn'tX[Y,sum(foo*bar)]
quicker to program and quicker to run than a merge of everything wastefully followed by a subset?
If you want a left outer join of X[Y]
le <- Y[X]
mallx <- merge(X, Y, all.x = T)
# the column order is different so change to be the same as `merge`
setcolorder(le, names(mallx))
identical(le, mallx)
# [1] TRUE
If you want a full outer join
# the unique values for the keys over both data sets
unique_keys <- unique(c(X[,t], Y[,t]))
Y[X[J(unique_keys)]]
## t b a
## 1: 1 NA 1
## 2: 2 NA 4
## 3: 3 9 9
## 4: 4 16 16
## 5: 5 25 NA
## 6: 6 36 NA
# The following will give the same with the column order X,Y
X[Y[J(unique_keys)]]
Efficiently joining two data.tables in R (or SQL tables) on date ranges?
data.table
For data.table
, this is mostly a dupe of How to perform join over date ranges using data.table?, though that doesn't provide the RHS[LHS, on=.(..)]
method.
observations
# dt_taken patient_id observation value
# 1 2020-04-13 00:00:00 patient01 Heart rate 69
admissions
# patient_id admission_id startdate enddate
# 1 patient01 admission01 2020-04-01 00:04:20 2020-05-01 00:23:59
### convert to data.table
setDT(observations)
setDT(admissions)
### we need proper 'POSIXt' objects
observations[, dt_taken := as.POSIXct(dt_taken)]
admissions[, (dates) := lapply(.SD, as.POSIXct), .SDcols = dates]
And the join.
admissions[observations, on = .(patient_id, startdate <= dt_taken, enddate >= dt_taken)]
# patient_id admission_id startdate enddate observation value
# <char> <char> <POSc> <POSc> <char> <int>
# 1: patient01 admission01 2020-04-13 2020-04-13 Heart rate 69
Two things that I believe are noteworthy:
in SQL (and similarly in other join-friendly languages), it is often shown as
select ...
from TABLE1 left join TABLE2 ...suggesting that
TABLE1
is the LHS (left-hand side) andTABLE2
is the RHS table. (This is a gross generalization, mostly gearing towards a left-join since that's all thatdata.table::[
supports; for inner/outer/full joins, you'll needmerge(.)
or other external mechanisms. See How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272 for more discussion on JOINs, etc.)From this,
data.table::[
's mechanism is effectivelyTABLE2[TABLE1, on = .(...)]
RHS[LHS, on = .(...)](Meaning that the right-hand-side table is actually the first table from left-to-right ...)
The names in the output of inequi-joins are preserved from the RHS, see that
dt_taken
is not found. However, the values of thosestartdate
andenddate
columns are fromdt_taken
.Because of this, I've often found the simplest way for me to wrap my brain around the renaming and values and such is when I'm not certain, I copy a join column into a new column and join using that column, then delete it post-merge. It's sloppy and lazy, but I've caught myself too many times missing something and thinking it was not what I had thought.
sqldf
This might be a little more direct if SQL seems more intuitive.
sqldf::sqldf(
"select ob.*, ad.admission_id
from observations ob
left join admissions ad on ob.patient_id=ad.patient_id
and ob.dt_taken between ad.startdate and ad.enddate")
# dt_taken patient_id observation value admission_id
# 1 2020-04-13 patient01 Heart rate 69 admission01
Data (already data.table
with POSIXt
, works just as well with sqldf
though regular data.frame
s will work just fine, too):
admissions <- setDT(structure(list(patient_id = "patient01", admission_id = "admission01", startdate = structure(1585713860, class = c("POSIXct", "POSIXt" ), tzone = ""), enddate = structure(1588307039, class = c("POSIXct", "POSIXt"), tzone = "")), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))
observations <- setDT(structure(list(dt_taken = structure(1586750400, class = c("POSIXct", "POSIXt"), tzone = ""), patient_id = "patient01", observation = "Heart rate", value = 69L), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))
(I use setDT
to repair the fact that we can't pass the .internal.selfref
attribute here.)
Related Topics
Simultaneously Merge Multiple Data.Frames in a List
Repeat Each Row of Data.Frame the Number of Times Specified in a Column
How to View the Source Code For a Function
Converting Year and Month ("Yyyy-Mm" Format) to a Date
Faster Ways to Calculate Frequencies and Cast from Long to Wide
Remove Rows With All or Some Nas (Missing Values) in Data.Frame
Drop Unused Factor Levels in a Subsetted Data Frame
Order Discrete X Scale by Frequency/Value
Combine a List of Data Frames into One Data Frame by Row
Finding All Duplicate Rows, Including "Elements With Smaller Subscripts"
Subset Data Frame Based on Number of Rows Per Group
Getting the Top Values by Group
Convert Row Names into First Column
R: Rjava Package Install Failing
Get the Difference Between Dates in Terms of Weeks, Months, Quarters, and Years
How to Specify the Size of a Graph in Ggplot2 Independent of Axis Labels