What Does < Stand for in Data.Table Joins with On=

What does stand for in data.table joins with on=

When doing a non-equi join like X[Y, on = .(A < A)] data.table returns the A-column from Y (the i-data.table).

To get the desired result, you could do:

X[Y, on = .(A < A), .(A = x.A, B)]

which gives:

In the next release, data.table will return both A columns. See here for the discussion.

Intuition for data.table join syntax

Right Join is the default and the new object (X) is right joined?

The reason for that is consistency to base R way of subset of vectors/matrices. I think there is an entry in FAQ for that.
Notice when you use := during join you get left join. There is an issue which discuss consistency of merges with [ to base R, afair #1615.

R Data.Table Join on Conditionals

It's a bit ugly but works:

library(data.table)
library(sqldf)

dt <- data.table(num=c(1, 2, 3, 4, 5, 6), 
                 char=c('A', 'A', 'A', 'B', 'B', 'B'), 
                 bool=c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE))

dt_two <- data.table(
  num =c(6, 1, 5, 2, 4, 3), 
  char=c('A', 'A', 'A', 'B', 'B', 'B'), 
  bool=c(TRUE, FALSE, TRUE, FALSE, TRUE, FALSE))

dt_out_sql <- sqldf('
    select dtone.num,
            dtone.char,
            dtone.bool,
            SUM(dttwo.num) as SUM,  
            MIN(dttwo.num) as MIN
    from    dt as dtone
    INNER join dt_two as dttwo on 
          (dtone.char = dttwo.char) and 
          (dtone.num >= dttwo.num OR dtone.bool)
    GROUP BY dtone.num, dtone.char, dtone.bool
  ')

setDT(dt_out_sql)

setkey(dt, char)
setkey(dt_two, char)

dt_out_r <- dt[dt_two,
               list(dtone.num = num,
                    dttwo.num = i.num,
                    char,
                    bool) ,
               nomatch = 0, allow.cartesian = T
               ][
                 dtone.num >= dttwo.num | bool,
                 list(SUM = sum(dttwo.num),
                      MIN = min(dttwo.num)),
                 by = list(num = dtone.num,
                           char,
                           bool)
                 ]

setkey(dt_out_r, num, char, bool)

all.equal(dt_out_sql, dt_out_r, check.attributes = FALSE)

data.table join is hard to understand

This is a non-equi join :

joins same x on both tables : b and c in this case
keeps only the values of DT where DT$y <= X$foo

Perhaps easier to understand like this :

DT[X,.(x.x, x.y, x.v, i.x, i.v, i.foo,`y < foo`= x.y < i.foo ), on = .(x = x, y <= foo)]

   x.x x.y x.v i.x i.v i.foo y < foo
1:   c   1   7   c   8     4    TRUE
2:   c   3   8   c   8     4    TRUE
3:   b   1   1   b   7     2    TRUE

Where:

x. are the columns of the LHS table (DT)
i. are the columns of the RHS table (X), to remember i. think about DT[i,j,by].

Efficiently joining two data.tables in R (or SQL tables) on date ranges?

data.table

For data.table, this is mostly a dupe of How to perform join over date ranges using data.table?, though that doesn't provide the RHS[LHS, on=.(..)] method.

observations
#              dt_taken patient_id observation value
# 1 2020-04-13 00:00:00  patient01  Heart rate    69
admissions
#   patient_id admission_id           startdate             enddate
# 1  patient01  admission01 2020-04-01 00:04:20 2020-05-01 00:23:59

### convert to data.table
setDT(observations)
setDT(admissions)

### we need proper 'POSIXt' objects
observations[, dt_taken := as.POSIXct(dt_taken)]
admissions[, (dates) := lapply(.SD, as.POSIXct), .SDcols = dates]

And the join.

admissions[observations, on = .(patient_id, startdate <= dt_taken, enddate >= dt_taken)]
#    patient_id admission_id  startdate    enddate observation value
#        <char>       <char>     <POSc>     <POSc>      <char> <int>
# 1:  patient01  admission01 2020-04-13 2020-04-13  Heart rate    69

Two things that I believe are noteworthy:

in SQL (and similarly in other join-friendly languages), it is often shown as
```
select ...
from TABLE1 left join TABLE2 ...
```
suggesting that TABLE1 is the LHS (left-hand side) and TABLE2 is the RHS table. (This is a gross generalization, mostly gearing towards a left-join since that's all that data.table::[ supports; for inner/outer/full joins, you'll need merge(.) or other external mechanisms. See How to join (merge) data frames (inner, outer, left, right) and https://stackoverflow.com/a/6188334/3358272 for more discussion on JOINs, etc.)
From this, data.table::['s mechanism is effectively
```
TABLE2[TABLE1, on = .(...)]
RHS[LHS, on = .(...)]
```
(Meaning that the right-hand-side table is actually the first table from left-to-right ...)

The names in the output of inequi-joins are preserved from the RHS, see that dt_taken is not found. However, the values of those startdate and enddate columns are from dt_taken.
Because of this, I've often found the simplest way for me to wrap my brain around the renaming and values and such is when I'm not certain, I copy a join column into a new column and join using that column, then delete it post-merge. It's sloppy and lazy, but I've caught myself too many times missing something and thinking it was not what I had thought.

sqldf

This might be a little more direct if SQL seems more intuitive.

sqldf::sqldf(
  "select ob.*, ad.admission_id
   from observations ob
     left join admissions ad on ob.patient_id=ad.patient_id
         and ob.dt_taken between ad.startdate and ad.enddate")
#     dt_taken patient_id observation value admission_id
# 1 2020-04-13  patient01  Heart rate    69  admission01

Data (already data.table with POSIXt, works just as well with sqldf though regular data.frames will work just fine, too):

admissions <- setDT(structure(list(patient_id = "patient01", admission_id = "admission01", startdate = structure(1585713860, class = c("POSIXct", "POSIXt" ), tzone = ""), enddate = structure(1588307039, class = c("POSIXct", "POSIXt"), tzone = "")), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))
observations <- setDT(structure(list(dt_taken = structure(1586750400, class = c("POSIXct", "POSIXt"), tzone = ""), patient_id = "patient01", observation = "Heart rate", value = 69L), class = c("data.table", "data.frame"), row.names = c(NA, -1L)))

(I use setDT to repair the fact that we can't pass the .internal.selfref attribute here.)

data.table conditional join overwrites columns used for condition- R

I think this is what you need...

df1[ df2, 
     `:=`( Date_2A = i.Date_2A, Date_2B = i.Date_2B, Date_2B_EXTENDED = i.Date_2B_EXTENDED ), 
     on = .( ID, date_1 >= Date_2A, date_1 <= Date_2B_EXTENDED ) ][]

output

#     ID row_unique_identifier     date_1    Date_2A    Date_2B Date_2B_EXTENDED
#  1:  1                     1 2016-11-14 2016-11-14 2016-11-14       2016-11-20
#  2:  1                     2 2016-11-14 2016-11-14 2016-11-14       2016-11-20
#  3:  1                     3 2016-11-14 2016-11-14 2016-11-14       2016-11-20
#  4:  1                     4 2016-11-14 2016-11-14 2016-11-14       2016-11-20
#  5:  1                     5 2016-11-14 2016-11-14 2016-11-14       2016-11-20
#  6:  1                     6 2016-11-14 2016-11-14 2016-11-14       2016-11-20
#  7:  1                     7 2016-11-14 2016-11-14 2016-11-14       2016-11-20
#  8:  1                     8 2017-11-02       <NA>       <NA>             <NA>
#  9:  1                     9 2017-11-17 2017-11-17 2017-11-17       2017-11-23
# 10:  1                    10 2017-11-17 2017-11-17 2017-11-17       2017-11-23
# 11:  1                    11 2017-11-17 2017-11-17 2017-11-17       2017-11-23
# 12:  1                    12 2017-11-17 2017-11-17 2017-11-17       2017-11-23
# 13:  1                    13 2017-11-17 2017-11-17 2017-11-17       2017-11-23
# 14:  1                    14 2017-11-17 2017-11-17 2017-11-17       2017-11-23
# 15:  1                    15 2017-11-17 2017-11-17 2017-11-17       2017-11-23
# 16:  1                    16 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 17:  1                    17 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 18:  1                    18 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 19:  1                    19 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 20:  1                    20 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 21:  1                    21 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 22:  1                    22 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 23:  1                    23 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 24:  1                    24 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 25:  1                    25 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 26:  1                    26 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 27:  1                    27 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 28:  1                    28 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 29:  1                    29 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 30:  1                    30 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 31:  1                    31 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 32:  1                    32 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 33:  1                    33 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 34:  1                    34 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 35:  1                    35 2018-12-06 2018-12-06 2018-12-07       2018-12-13
# 36:  1                    36 2018-12-07 2018-12-06 2018-12-07       2018-12-13
# 37:  1                    37 2018-12-07 2018-12-06 2018-12-07       2018-12-13
# 38:  1                    38 2018-12-07 2018-12-06 2018-12-07       2018-12-13
# 39:  1                    39 2018-12-07 2018-12-06 2018-12-07       2018-12-13
# 40:  1                    40 2018-12-07 2018-12-06 2018-12-07       2018-12-13
# 41:  1                    41 2018-12-07 2018-12-06 2018-12-07       2018-12-13
#     ID row_unique_identifier     date_1    Date_2A    Date_2B Date_2B_EXTENDED

Conditional join in data.table?

I suggest to use non-equi joins combined with mult = "last" (in order to capture only the most recent EndDate)

dtgrouped2[, c("Amount1", "Amount2") := # Assign the below result to new columns in dtgrouped2
              dt2[dtgrouped2, # join
                  .(Amount1, Amount2), # get the column you need
                  on = .(Unique1 = Unique, # join conditions
                         StartDate < MonthNo, 
                         EndDate >= MonthNo), 
                  mult = "last"]] # get always the latest EndDate
dtgrouped2

#     MonthNo Unique Total Amount1 Amount2
#  1:       1    AAA    10       7       0
#  2:       1    BBB     0      NA      NA
#  3:       2    CCC     3      NA      NA
#  4:       2    DDD     0      NA      NA
#  5:       3    AAA     0       3       2
#  6:       3    BBB    35      NA      NA
#  7:       4    CCC    15      NA      NA
#  8:       4    AAA     0       3       2
#  9:       5    BBB    60      NA      NA
# 10:       5    CCC     0      NA      NA
# 11:       6    DDD   100      NA      NA
# 12:       6    AAA     0      NA      NA

The reason that you would need to join dt2[dtgrouped] first (and not the other way around) is because you want to join dt2 for each possible value in dtgrouped, hence allow multiple values in dt2 to be joined to dtgrouped

Permutations with data.table join

You can eliminate the cases where product_1 and product_2 are equal after joining

df[df, on = .(customer_id = customer_id), allow.cartesian = T
   ][product_1 != i.product_1
     ][order(product_1)]

   customer_id product_1 i.product_1
1:           1         a           b
2:           1         a           c
3:           1         b           a
4:           1         b           c
5:           1         c           a
6:           1         c           b

Roll join gives NA's in data.table

When performing an X[Y] join in data.table what you are basically doing is for each value in Y you are trying to find a value in X. Hence, the resulting join will be of length of the Ys table. In your case, you are trying to find a value in Limits for each value in Usage and get a 7 length vector. Hence, you probably should join the other way around and then store it back into Limits

Limits[Usage, 
       oldLimit, 
       on = .(costcenter = cc, featureId = feature, vendorId = vendor, date = startDate),
       roll = TRUE]
## [1] 6 6 6 6 5 5 5

As a side note, for very (and some times not so) simple cases you could just use findInterval.

setorder(Limits, date)[findInterval(Usage$startDate, date), oldLimit]
## [1] 6 6 6 6 5 5 5

It is a very efficient function that have some caveats though

You need to sort the intervals vector first.
You can't set the rolling intervals easily as you would do in data.table (e.g. roll = 2 instead of just roll = TRUE)
And probably the biggest disadvantage is that it will be tricky to perform a rolling join on several variables at once (without involving loops) as you would easily do with data.table

What Does < Stand for in Data.Table Joins with On=