What Does < Stand for in Data.Table Joins with On=

What does stand for in data.table joins with on=

When doing a non-equi join like X[Y, on = .(A < A)] data.table returns the A-column from Y (the i-data.table).

To get the desired result, you could do:

X[Y, on = .(A < A), .(A = x.A, B)]

which gives:

   A B
1: 1 1
2: 2 1
3: 3 1

In the next release, data.table will return both A columns. See here for the discussion.

Intuition for data.table join syntax

Right Join is the default and the new object (X) is right joined?

The reason for that is consistency to base R way of subset of vectors/matrices. I think there is an entry in FAQ for that.
Notice when you use := during join you get left join. There is an issue which discuss consistency of merges with [ to base R, afair #1615.

R Data.Table Join on Conditionals

It's a bit ugly but works:

library(data.table)
library(sqldf)

dt <- data.table(num=c(1, 2, 3, 4, 5, 6),
char=c('A', 'A', 'A', 'B', 'B', 'B'),
bool=c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE))

dt_two <- data.table(
num =c(6, 1, 5, 2, 4, 3),
char=c('A', 'A', 'A', 'B', 'B', 'B'),
bool=c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE))

dt_out_sql <- sqldf('
select dtone.num,
dtone.char,
dtone.bool,
SUM(dttwo.num) as SUM,
MIN(dttwo.num) as MIN
from dt as dtone
INNER join dt_two as dttwo on
(dtone.char = dttwo.char) and
(dtone.num >= dttwo.num OR dtone.bool)
GROUP BY dtone.num, dtone.char, dtone.bool
')

setDT(dt_out_sql)

setkey(dt, char)
setkey(dt_two, char)

dt_out_r <- dt[dt_two,
list(dtone.num = num,
dttwo.num = i.num,
char,
bool) ,
nomatch = 0, allow.cartesian = T
][
dtone.num >= dttwo.num | bool,
list(SUM = sum(dttwo.num),
MIN = min(dttwo.num)),
by = list(num = dtone.num,
char,
bool)
]

setkey(dt_out_r, num, char, bool)

all.equal(dt_out_sql, dt_out_r, check.attributes = FALSE)

data.table join is hard to understand

This is a non-equi join :

  • joins same x on both tables : b and c in this case
  • keeps only the values of DT where DT$y <= X$foo

Perhaps easier to understand like this :

DT[X,.(x.x, x.y, x.v, i.x, i.v, i.foo,`y < foo`= x.y < i.foo ), on = .(x = x, y <= foo)]

x.x x.y x.v i.x i.v i.foo y < foo
1: c 1 7 c 8 4 TRUE
2: c 3 8 c 8 4 TRUE
3: b 1 1 b 7 2 TRUE

Where:

  • x. are the columns of the LHS table (DT)
  • i. are the columns of the RHS table (X), to remember i. think about DT[i,j,by].

Efficiently joining two data.tables in R (or SQL tables) on date ranges?

data.table

For data.table, this is mostly a dupe of How to perform join over date ranges using data.table?, though that doesn't provide the RHS[LHS, on=.(..)] method.

observations
# dt_taken patient_id observation value
# 1 2020-04-13 00:00:00 patient01 Heart rate 69
admissions
# patient_id admission_id startdate enddate
# 1 patient01 admission01 2020-04-01 00:04:20 2020-05-01 00:23:59

### convert to data.table
setDT(observations)
setDT(admissions)

### we need proper 'POSIXt' objects
observations[, dt_taken := as.POSIXct(dt_taken)]
admissions[, (dates) := lapply(.SD, as.POSIXct), .SDcols = dates]

And the join.

admissions[observations, on = .(patient_id, startdate <= dt_taken, enddate >= dt_taken)]
# patient_id admission_id startdate enddate observation value
# <char> <char> <POSc> <POSc> <char> <int>
# 1: patient01 admission01 2020-04-13 2020-04-13 Heart rate 69

Two things that I believe are noteworthy:

  • in SQL (and similarly in other join-friendly languages), it is often shown as

    select ...
    from TABLE1 left join TABLE2 ...

    suggesting that TABLE1 is the LHS (left-hand side) and TABLE2 is the RHS table. (This is a gross generalization, mostly gearing towards a left-join since that's all that data.table::[ supports; for inner/outer/full joins, you'll need merge(.) or other external mechanisms. See How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272 for more discussion on JOINs, etc.)

    From this, data.table::['s mechanism is effectively

    TABLE2[TABLE1, on = .(...)]
    RHS[LHS, on = .(...)]

    (Meaning that the right-hand-side table is actually the first table from left-to-right ...)


  1. The names in the output of inequi-joins are preserved from the RHS, see that dt_taken is not found. However, the values of those startdate and enddate columns are from dt_taken.

    Because of this, I've often found the simplest way for me to wrap my brain around the renaming and values and such is when I'm not certain, I copy a join column into a new column and join using that column, then delete it post-merge. It's sloppy and lazy, but I've caught myself too many times missing something and thinking it was not what I had thought.

sqldf

This might be a little more direct if SQL seems more intuitive.

sqldf::sqldf(
"select ob.*, ad.admission_id
from observations ob
left join admissions ad on ob.patient_id=ad.patient_id
and ob.dt_taken between ad.startdate and ad.enddate")
# dt_taken patient_id observation value admission_id
# 1 2020-04-13 patient01 Heart rate 69 admission01

Data (already data.table with POSIXt, works just as well with sqldf though regular data.frames will work just fine, too):

admissions <- setDT(structure(list(patient_id = "patient01", admission_id = "admission01", startdate = structure(1585713860, class = c("POSIXct", "POSIXt" ), tzone = ""), enddate = structure(1588307039, class = c("POSIXct", "POSIXt"), tzone = "")), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))
observations <- setDT(structure(list(dt_taken = structure(1586750400, class = c("POSIXct", "POSIXt"), tzone = ""), patient_id = "patient01", observation = "Heart rate", value = 69L), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))

(I use setDT to repair the fact that we can't pass the .internal.selfref attribute here.)

data.table conditional join overwrites columns used for condition- R

I think this is what you need...

df1[ df2, 
`:=`( Date_2A = i.Date_2A, Date_2B = i.Date_2B, Date_2B_EXTENDED = i.Date_2B_EXTENDED ),
on = .( ID, date_1 >= Date_2A, date_1 <= Date_2B_EXTENDED ) ][]

output

#     ID row_unique_identifier     date_1    Date_2A    Date_2B Date_2B_EXTENDED
# 1: 1 1 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 2: 1 2 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 3: 1 3 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 4: 1 4 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 5: 1 5 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 6: 1 6 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 7: 1 7 2016-11-14 2016-11-14 2016-11-14 2016-11-20
# 8: 1 8 2017-11-02 <NA> <NA> <NA>
# 9: 1 9 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 10: 1 10 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 11: 1 11 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 12: 1 12 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 13: 1 13 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 14: 1 14 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 15: 1 15 2017-11-17 2017-11-17 2017-11-17 2017-11-23
# 16: 1 16 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 17: 1 17 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 18: 1 18 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 19: 1 19 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 20: 1 20 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 21: 1 21 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 22: 1 22 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 23: 1 23 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 24: 1 24 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 25: 1 25 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 26: 1 26 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 27: 1 27 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 28: 1 28 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 29: 1 29 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 30: 1 30 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 31: 1 31 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 32: 1 32 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 33: 1 33 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 34: 1 34 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 35: 1 35 2018-12-06 2018-12-06 2018-12-07 2018-12-13
# 36: 1 36 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 37: 1 37 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 38: 1 38 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 39: 1 39 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 40: 1 40 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# 41: 1 41 2018-12-07 2018-12-06 2018-12-07 2018-12-13
# ID row_unique_identifier date_1 Date_2A Date_2B Date_2B_EXTENDED

Conditional join in data.table?

I suggest to use non-equi joins combined with mult = "last" (in order to capture only the most recent EndDate)

dtgrouped2[, c("Amount1", "Amount2") := # Assign the below result to new columns in dtgrouped2
dt2[dtgrouped2, # join
.(Amount1, Amount2), # get the column you need
on = .(Unique1 = Unique, # join conditions
StartDate < MonthNo,
EndDate >= MonthNo),
mult = "last"]] # get always the latest EndDate
dtgrouped2

# MonthNo Unique Total Amount1 Amount2
# 1: 1 AAA 10 7 0
# 2: 1 BBB 0 NA NA
# 3: 2 CCC 3 NA NA
# 4: 2 DDD 0 NA NA
# 5: 3 AAA 0 3 2
# 6: 3 BBB 35 NA NA
# 7: 4 CCC 15 NA NA
# 8: 4 AAA 0 3 2
# 9: 5 BBB 60 NA NA
# 10: 5 CCC 0 NA NA
# 11: 6 DDD 100 NA NA
# 12: 6 AAA 0 NA NA

The reason that you would need to join dt2[dtgrouped] first (and not the other way around) is because you want to join dt2 for each possible value in dtgrouped, hence allow multiple values in dt2 to be joined to dtgrouped

Permutations with data.table join

You can eliminate the cases where product_1 and product_2 are equal after joining

df[df, on = .(customer_id = customer_id), allow.cartesian = T
][product_1 != i.product_1
][order(product_1)]

customer_id product_1 i.product_1
1: 1 a b
2: 1 a c
3: 1 b a
4: 1 b c
5: 1 c a
6: 1 c b

Roll join gives NA's in data.table

When performing an X[Y] join in data.table what you are basically doing is for each value in Y you are trying to find a value in X. Hence, the resulting join will be of length of the Ys table. In your case, you are trying to find a value in Limits for each value in Usage and get a 7 length vector. Hence, you probably should join the other way around and then store it back into Limits

Limits[Usage, 
oldLimit,
on = .(costcenter = cc, featureId = feature, vendorId = vendor, date = startDate),
roll = TRUE]
## [1] 6 6 6 6 5 5 5

As a side note, for very (and some times not so) simple cases you could just use findInterval.

setorder(Limits, date)[findInterval(Usage$startDate, date), oldLimit]
## [1] 6 6 6 6 5 5 5

It is a very efficient function that have some caveats though

  • You need to sort the intervals vector first.
  • You can't set the rolling intervals easily as you would do in data.table (e.g. roll = 2 instead of just roll = TRUE)
  • And probably the biggest disadvantage is that it will be tricky to perform a rolling join on several variables at once (without involving loops) as you would easily do with data.table


Related Topics



Leave a reply



Submit