Rolling Joins Data.Table in R

rolling joins data.table in R

That quote from the documentation appears to be from FAQ 1.12 What is the difference between X[Y] and merge(X,Y). Did you find the following in ?data.table and does it help?

roll Applies to the last join column, generally a date but can be any
ordered variable, irregular and including gaps. If roll=TRUE and i's
row matches to all but the last x join column, and its value in the
last i join column falls in a gap (including after the last
observation in x for that group), then the prevailing value in x is
rolled forward. This operation is particularly fast using a modified
binary search. The operation is also known as last observation carried
forward (LOCF). Usually, there should be no duplicates in x's key, the
last key column is a date (or time, or datetime) and all the columns
of x's key are joined to. A common idiom is to select a
contemporaneous regular time series (dts) across a set of identifiers
(ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date)
and CJ stands for cross join.

rolltolast Like roll but the data is not rolled forward past the last
observation within each group defined by the join columns. The value
of i must fall in a gap in x but not after the end of the data, for
that group defined by all but the last join column. roll and
rolltolast may not both be TRUE.

In terms of left/right analogies to SQL joins, I prefer to think about that in the context of FAQ 2.14 Can you explain further why data.table is inspired by A[B] syntax
in base. That's quite a long answer so I won't paste it here.

How to do a data.table rolling join?

Instead of a rolling join, you may want to use an overlap join with the foverlaps function of data.table:

# create an interval in the 'companies' datatable
companies[, `:=` (start = compDate - days(90), end = compDate + days(15))]
# create a second date in the 'dividends' datatable
dividends[, Date2 := divDate]

# set the keys for the two datatable
setkey(companies, Sedol, start, end)
setkey(dividends, Sedol, divDate, Date2)

# create a vector of columnnames which can be removed afterwards
deletecols <- c("Date2","start","end")

# perform the overlap join and remove the helper columns
res <- foverlaps(companies, dividends)[, (deletecols) := NULL]

the result:

> res
     Sedol DivID    divDate   DivAmnt companyID   compDate    MktCap
 1: 7A662B    NA       <NA>        NA         6 2005-03-31  61.21061
 2: 7A662B     5 2005-06-29 0.7772631         7 2005-06-30  66.92951
 3: 7A662B     6 2005-06-30 1.1815343         7 2005-06-30  66.92951
 4: 7A662B    NA       <NA>        NA         8 2005-09-30  78.33914
 5: 7A662B    NA       <NA>        NA         9 2005-12-31  88.92473
 6: 7A662B    NA       <NA>        NA        10 2006-03-31  87.85067
 7: 91772E     2 2005-01-13 0.2964291         1 2005-03-31 105.19249
 8: 91772E     3 2005-01-29 0.8472649         1 2005-03-31 105.19249
 9: 91772E    NA       <NA>        NA         2 2005-06-30 108.74579
10: 91772E     4 2005-10-01 1.2467408         3 2005-09-30 113.42261
11: 91772E    NA       <NA>        NA         4 2005-12-31 120.04491
12: 91772E    NA       <NA>        NA         5 2006-03-31 124.35588

In the meantime the data.table authors have introduced non-equi joins (v1.9.8). You can also use that to solve this problem. Using a non-equi join you just need:

companies[, `:=` (start = compDate - days(90), end = compDate + days(15))]
dividends[companies, on = .(Sedol, divDate >= start, divDate <= end)]

to get the intended result.

Used data (the same as in the question, but without the creation of the keys):

set.seed(1337)
companies <- data.table(companyID = 1:10, Sedol = rep(c("91772E", "7A662B"), each = 5),
                        compDate = (as.Date("2005-04-01") + months(seq(0, 12, 3))) - days(1),
                        MktCap = c(100 + cumsum(rnorm(5,5)), 50 + cumsum(rnorm(5,1,5))))
dividends <- data.table(DivID = 1:7, Sedol = c(rep('91772E', each = 4), rep('7A662B', each = 3)),
                        divDate = as.Date(c('2004-11-19','2005-01-13','2005-01-29','2005-10-01','2005-06-29','2005-06-30','2006-04-17')),
                        DivAmnt = rnorm(7, .8, .3))

Rolling join two data.tables with date in R

We may use non-equi join

dt1[dt2, date_2 := date2, on = .(group, date1 > date2), mult = "first"]

rolling join two time series in data.table

You could use merge with all=T and setnafill with type='locf':

setnafill(merge(dt1,dt2,all=T),type="locf")[]

Key: <k>
       k   v.x   v.y
   <num> <num> <num>
1:     2    NA    11
2:     3    10    11
3:     4    10    13
4:     5    15    13
5:     6    15     6
6:     7     9     6
7:     9     7     6

Rolling join grouped by a second variable in data.table

You were almost there.

You can join on multiple columns simultaneously. So, in addition to "Date", you can include "Field" in the on clause. But please note the description of the roll argument in ?data.table:

Rolling joins apply to the last join column

Thus, for "Date" to be used for the rolling join, specify it as the last variable in on:

library(data.table)
d1[d2, roll = "nearest", on = .(Field, Date)]

For better verification, the result can be ordered

d1[d2, roll = "nearest", on = .(Field, Date)][order(Field, Date)]

    Field       Date  NlbsAcre      TotN
 1:   12S 2016-05-24        NA 208.62194
 2:   12S 2016-05-27        NA 172.57658
 3:   12S 2016-07-31        NA 318.97092
 4:   12S 2016-08-18        NA 428.54011
 5:   12S 2016-08-29        NA 393.81545
 6:   12S 2017-03-13 44728.184 145.15091
 7:   12S 2017-03-16 44728.184 128.14334
 8:   12S 2017-08-01 12621.083 132.72365
 9:   12S 2017-08-04 12621.083 422.63032
10:   12S 2017-08-14 12621.083 337.91388
11:   12S 2017-10-04 22162.203 692.15276
12:  19-1 2016-05-01 12630.923 476.17492
13:  19-1 2016-08-15 12630.923 110.70600
14:  19-1 2016-09-10 12630.923 215.88105
15:  19-1 2016-09-19 12630.923 224.68906
16:  19-1 2016-12-16 12630.923 338.59349
17:  19-1 2017-01-13 12630.923 305.35394
18:  19-1 2017-03-27 12630.923 435.04925
19:  19-1 2017-05-30 12630.923 818.80997
20:     6 2016-05-05        NA 102.53240
21:     6 2016-06-14        NA 149.06045
22:     6 2016-06-29        NA 125.82803
23:     6 2016-06-29        NA 125.82803
24:     6 2016-07-11        NA  79.24480
25:     6 2016-07-25        NA  62.24449
26:     6 2016-08-25        NA  75.77014
27:     6 2017-01-03  2014.772  47.49660
28:     6 2017-01-12  2014.772  45.53730
29:     6 2017-01-17  2014.772  43.92222
30:     6 2017-02-11  3082.318  21.96791
31:     6 2017-03-19  2477.083  21.39367
32:     6 2017-04-17  2427.536  79.03807
33:     6 2017-07-12        NA 103.52417
34:     6 2017-07-17        NA  65.53112
35:     6 2017-09-06        NA  47.40618
36:     7 2016-06-02        NA 147.49353
37:     7 2016-07-11        NA  59.26973
38:     7 2016-08-04        NA  72.62146
39:     7 2016-08-30        NA  58.27003
40:     7 2016-08-30        NA  58.27003
41:     7 2016-10-30        NA  73.88811
42:     7 2017-02-11  2279.609  21.07551
43:     7 2017-02-22  2279.609  19.92023
44:     7 2017-03-19 15842.916  31.71433
45:     7 2017-05-17        NA  44.96872
46:     7 2017-07-17        NA  58.53364
47:   W62 2016-05-05 16764.975  96.72854
48:   W62 2016-05-31 16764.975  72.96954
49:   W62 2016-08-31 16764.975  86.33588
50:   W62 2016-12-05 16764.975  94.19370
51:   W62 2017-01-02 18874.656 119.39040
52:   W62 2017-02-22 18874.656  75.46591
    Field       Date  NlbsAcre      TotN

Rolling join in Data.Table by two variables without creating duplicates

First of all, you could use unique instead of distinct
(the latter presumably from dplyr; you don't specify)
to avoid coercing the data table to a data frame.

You were pretty close,
but you need to switch the tables in the join,
i.e. something like df2[df1],
so that the rows from df1 are used as search keys,
and then you can use mult to remove duplicates.

Here's one way to do what you want with a non-equi join:

setkey(df1, departure)
setkey(df2, departure)

df1[, max_departure := departure + as.difftime(1, units = "hours")
    ][, observed_departure := df2[df1,
                                  x.departure,
                                  on = .(stop_id, departure >= departure, departure <= max_departure),
                                  mult = "first"]
      ][, max_departure := NULL]

We order by departure (via setkey) so that mult = "first" returns the closest match in the future within what's allowed.
The intermediate column max_departure has to be assigned and subsequently removed because non-equi joins can only use existing columns.
Also note that the syntax used takes from this answer
(the version with .SD instead of df1 doesn't work in this case,
I don't know why).

EDIT: based on the comments,
it occurs to me that when you say "duplicated",
you might be referring to something different.
Say you have planned departures at 10 and 10:30,
but the one at 10 never takes place,
and an observed departure is 10:31.
Perhaps you mean that 10:31 is the observed departure for the one scheduled at 10:30,
and cannot be used for the one at 10?
If that's the case,
perhaps this will work:

setkey(df1, departure)
setkey(df2, departure)

max_dep <- function(departure) {
  max_departure <- departure + as.difftime(1, units = "hours")

  next_departure <- shift(departure,
                          fill = max_departure[length(max_departure)] + as.difftime(1, units = "secs"),
                          type = "lead")

  invalid_max <- max_departure >= next_departure

  max_departure[invalid_max] <- next_departure[invalid_max] - as.difftime(1, units = "secs")
  max_departure
}

df1[, max_departure := max_dep(departure), by = "stop_id"
    ][, observed_departure := df2[df1,
                                  x.departure,
                                  on = .(stop_id, departure >= departure, departure <= max_departure),
                                  mult = "first"]
      ][, max_departure := NULL]

The max_dep helper checks,
for each stop and scheduled departure,
what would be the next scheduled departure,
and sets max_departure as "next minus 1 second" if the next departure is within one hour.

The other solution wouldn't work for this because,
as long as an observed departure falls within one hour of the scheduled one,
it is a valid option.
In my example that means 10:31 would be valid for both 10:30 and 10.

R rolling join two data.tables with error margin on join

A data.table answer has been given here by user Uwe:

https://stackoverflow.com/a/62321710/12079387

How to perform a inner roll join with data table?

To replicate an inner join you can use the nomatch argument.

df_test_2[df_test_1, roll = -Inf, nomatch=0]

Rolling Joins Data.Table in R