What does stand for in data.table joins with on=
When doing a non-equi join like X[Y, on = .(A < A)]
data.table returns the A
-column from Y
(the i
-data.table).
To get the desired result, you could do:
X[Y, on = .(A < A), .(A = x.A, B)]
which gives:
A B
1: 1 1
2: 2 1
3: 3 1
In the next release, data.table will return both A
columns. See here for the discussion.
Intuition for data.table join syntax
Right Join is the default and the new object (X) is right joined?
The reason for that is consistency to base R way of subset of vectors/matrices. I think there is an entry in FAQ for that.
Notice when you use := during join you get left join. There is an issue which discuss consistency of merges with [ to base R, afair #1615.
R Data.Table Join on Conditionals
It's a bit ugly but works:
library(data.table)
library(sqldf)
dt <- data.table(num=c(1, 2, 3, 4, 5, 6),
char=c('A', 'A', 'A', 'B', 'B', 'B'),
bool=c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE))
dt_two <- data.table(
num =c(6, 1, 5, 2, 4, 3),
char=c('A', 'A', 'A', 'B', 'B', 'B'),
bool=c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE))
dt_out_sql <- sqldf('
select dtone.num,
dtone.char,
dtone.bool,
SUM(dttwo.num) as SUM,
MIN(dttwo.num) as MIN
from dt as dtone
INNER join dt_two as dttwo on
(dtone.char = dttwo.char) and
(dtone.num >= dttwo.num OR dtone.bool)
GROUP BY dtone.num, dtone.char, dtone.bool
')
setDT(dt_out_sql)
setkey(dt, char)
setkey(dt_two, char)
dt_out_r <- dt[dt_two,
list(dtone.num = num,
dttwo.num = i.num,
char,
bool) ,
nomatch = 0, allow.cartesian = T
][
dtone.num >= dttwo.num | bool,
list(SUM = sum(dttwo.num),
MIN = min(dttwo.num)),
by = list(num = dtone.num,
char,
bool)
]
setkey(dt_out_r, num, char, bool)
all.equal(dt_out_sql, dt_out_r, check.attributes = FALSE)
data.table join is hard to understand
This is a non-equi join :
- joins same x on both tables :
b
andc
in this case - keeps only the values of DT where
DT$y <= X$foo
Perhaps easier to understand like this :
DT[X,.(x.x, x.y, x.v, i.x, i.v, i.foo,`y < foo`= x.y < i.foo ), on = .(x = x, y <= foo)]
x.x x.y x.v i.x i.v i.foo y < foo
1: c 1 7 c 8 4 TRUE
2: c 3 8 c 8 4 TRUE
3: b 1 1 b 7 2 TRUE
Where:
x.
are the columns of theLHS
table (DT
)i.
are the columns of theRHS
table (X
), to rememberi.
think aboutDT[i,j,by]
.
Efficiently joining two data.tables in R (or SQL tables) on date ranges?
data.table
For data.table
, this is mostly a dupe of How to perform join over date ranges using data.table?, though that doesn't provide the RHS[LHS, on=.(..)]
method.
observations
# dt_taken patient_id observation value
# 1 2020-04-13 00:00:00 patient01 Heart rate 69
admissions
# patient_id admission_id startdate enddate
# 1 patient01 admission01 2020-04-01 00:04:20 2020-05-01 00:23:59
### convert to data.table
setDT(observations)
setDT(admissions)
### we need proper 'POSIXt' objects
observations[, dt_taken := as.POSIXct(dt_taken)]
admissions[, (dates) := lapply(.SD, as.POSIXct), .SDcols = dates]
And the join.
admissions[observations, on = .(patient_id, startdate <= dt_taken, enddate >= dt_taken)]
# patient_id admission_id startdate enddate observation value
# <char> <char> <POSc> <POSc> <char> <int>
# 1: patient01 admission01 2020-04-13 2020-04-13 Heart rate 69
Two things that I believe are noteworthy:
in SQL (and similarly in other join-friendly languages), it is often shown as
select ...
from TABLE1 left join TABLE2 ...suggesting that
TABLE1
is the LHS (left-hand side) andTABLE2
is the RHS table. (This is a gross generalization, mostly gearing towards a left-join since that's all thatdata.table::[
supports; for inner/outer/full joins, you'll needmerge(.)
or other external mechanisms. See How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272 for more discussion on JOINs, etc.)From this,
data.table::[
's mechanism is effectivelyTABLE2[TABLE1, on = .(...)]
RHS[LHS, on = .(...)](Meaning that the right-hand-side table is actually the first table from left-to-right ...)
The names in the output of inequi-joins are preserved from the RHS, see that
dt_taken
is not found. However, the values of thosestartdate
andenddate
columns are fromdt_taken
.Because of this, I've often found the simplest way for me to wrap my brain around the renaming and values and such is when I'm not certain, I copy a join column into a new column and join using that column, then delete it post-merge. It's sloppy and lazy, but I've caught myself too many times missing something and thinking it was not what I had thought.
sqldf
This might be a little more direct if SQL seems more intuitive.
sqldf::sqldf(
"select ob.*, ad.admission_id
from observations ob
left join admissions ad on ob.patient_id=ad.patient_id
and ob.dt_taken between ad.startdate and ad.enddate")
# dt_taken patient_id observation value admission_id
# 1 2020-04-13 patient01 Heart rate 69 admission01
Data (already data.table
with POSIXt
, works just as well with sqldf
though regular data.frame
s will work just fine, too):
admissions <- setDT(structure(list(patient_id = "patient01", admission_id = "admission01", startdate = structure(1585713860, class = c("POSIXct", "POSIXt" ), tzone = ""), enddate = structure(1588307039, class = c("POSIXct", "POSIXt"), tzone = "")), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))
observations <- setDT(structure(list(dt_taken = structure(1586750400, class = c("POSIXct", "POSIXt"), tzone = ""), patient_id = "patient01", observation = "Heart rate", value = 69L), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))
(I use setDT
to repair the fact that we can't pass the .internal.selfref
attribute here.)
data.table conditional join overwrites columns used for condition- R
I think this is what you need...
df1[ df2,
`:=`( Date_2A = i.Date_2A, Date_2B = i.Date_2B, Date_2B_EXTENDED = i.Date_2B_EXTENDED ),
on = .( ID, date_1 >= Date_2A, date_1 <= Date_2B_EXTENDED ) ][]
output
# ID row_unique_identifier date_1 Date_2A Date_2B Date_2B_EXTENDED
# 1: 1 1 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 2: 1 2 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 3: 1 3 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 4: 1 4 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 5: 1 5 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 6: 1 6 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 7: 1 7 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 8: 1 8 2017-11-02 <NA> <NA> <NA>
# 9: 1 9 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 10: 1 10 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 11: 1 11 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 12: 1 12 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 13: 1 13 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 14: 1 14 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 15: 1 15 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 16: 1 16 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 17: 1 17 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 18: 1 18 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 19: 1 19 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 20: 1 20 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 21: 1 21 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 22: 1 22 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 23: 1 23 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 24: 1 24 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 25: 1 25 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 26: 1 26 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 27: 1 27 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 28: 1 28 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 29: 1 29 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 30: 1 30 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 31: 1 31 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 32: 1 32 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 33: 1 33 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 34: 1 34 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 35: 1 35 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 36: 1 36 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 37: 1 37 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 38: 1 38 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 39: 1 39 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 40: 1 40 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 41: 1 41 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# ID row_unique_identifier date_1 Date_2A Date_2B Date_2B_EXTENDED
Conditional join in data.table?
I suggest to use non-equi joins combined with mult = "last"
(in order to capture only the most recent EndDate
)
dtgrouped2[, c("Amount1", "Amount2") := # Assign the below result to new columns in dtgrouped2
dt2[dtgrouped2, # join
.(Amount1, Amount2), # get the column you need
on = .(Unique1 = Unique, # join conditions
StartDate < MonthNo,
EndDate >= MonthNo),
mult = "last"]] # get always the latest EndDate
dtgrouped2
# MonthNo Unique Total Amount1 Amount2
# 1: 1 AAA 10 7 0
# 2: 1 BBB 0 NA NA
# 3: 2 CCC 3 NA NA
# 4: 2 DDD 0 NA NA
# 5: 3 AAA 0 3 2
# 6: 3 BBB 35 NA NA
# 7: 4 CCC 15 NA NA
# 8: 4 AAA 0 3 2
# 9: 5 BBB 60 NA NA
# 10: 5 CCC 0 NA NA
# 11: 6 DDD 100 NA NA
# 12: 6 AAA 0 NA NA
The reason that you would need to join dt2[dtgrouped]
first (and not the other way around) is because you want to join dt2
for each possible value in dtgrouped
, hence allow multiple values in dt2
to be joined to dtgrouped
Permutations with data.table join
You can eliminate the cases where product_1
and product_2
are equal after joining
df[df, on = .(customer_id = customer_id), allow.cartesian = T
][product_1 != i.product_1
][order(product_1)]
customer_id product_1 i.product_1
1: 1 a b
2: 1 a c
3: 1 b a
4: 1 b c
5: 1 c a
6: 1 c b
Roll join gives NA's in data.table
When performing an X[Y]
join in data.table
what you are basically doing is for each value in Y
you are trying to find a value in X
. Hence, the resulting join will be of length of the Y
s table. In your case, you are trying to find a value in Limits
for each value in Usage
and get a 7 length vector. Hence, you probably should join the other way around and then store it back into Limits
Limits[Usage,
oldLimit,
on = .(costcenter = cc, featureId = feature, vendorId = vendor, date = startDate),
roll = TRUE]
## [1] 6 6 6 6 5 5 5
As a side note, for very (and some times not so) simple cases you could just use findInterval
.
setorder(Limits, date)[findInterval(Usage$startDate, date), oldLimit]
## [1] 6 6 6 6 5 5 5
It is a very efficient function that have some caveats though
- You need to sort the intervals vector first.
- You can't set the rolling intervals easily as you would do in
data.table
(e.g.roll = 2
instead of justroll = TRUE
) - And probably the biggest disadvantage is that it will be tricky to perform a rolling join on several variables at once (without involving loops) as you would easily do with
data.table
Related Topics
Count Unique Combinations of Values
Implementation of Skyline Query or Efficient Frontier
Draw Lines Between Different Elements in a Stacked Bar Plot
How to Add Se Error Bars to My Barplot in Ggplot2
Extracting Indices for Data Frame Rows That Have Max Value for Named Field
Parallel Processing in R Limited
Remove Columns of Dataframe Based on Conditions in R
Accessing Parent Namespace Inside a Shiny Module
Automated Formula Construction
Creating New Shape Palettes in Ggplot2 and Other R Graphics
Using Lm in List Column to Predict New Values Using Purrr
Cannot Read File with "#" and Space Using Read.Table or Read.CSV in R
Plot a Jpg Image Using Base Graphics in R
Why Does Is.Vector() Return True for List