How to Do a Data.Table Rolling Join

rolling joins data.table in R

That quote from the documentation appears to be from FAQ 1.12 What is the difference between X[Y] and merge(X,Y). Did you find the following in ?data.table and does it help?

roll Applies to the last join column, generally a date but can be any
ordered variable, irregular and including gaps. If roll=TRUE and i's
row matches to all but the last x join column, and its value in the
last i join column falls in a gap (including after the last
observation in x for that group), then the prevailing value in x is
rolled forward. This operation is particularly fast using a modified
binary search. The operation is also known as last observation carried
forward (LOCF). Usually, there should be no duplicates in x's key, the
last key column is a date (or time, or datetime) and all the columns
of x's key are joined to. A common idiom is to select a
contemporaneous regular time series (dts) across a set of identifiers
(ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date)
and CJ stands for cross join.

rolltolast Like roll but the data is not rolled forward past the last
observation within each group defined by the join columns. The value
of i must fall in a gap in x but not after the end of the data, for
that group defined by all but the last join column. roll and
rolltolast may not both be TRUE.

In terms of left/right analogies to SQL joins, I prefer to think about that in the context of FAQ 2.14 Can you explain further why data.table is inspired by A[B] syntax
in base
. That's quite a long answer so I won't paste it here.

How to do a data.table rolling join?

Instead of a rolling join, you may want to use an overlap join with the foverlaps function of data.table:

# create an interval in the 'companies' datatable
companies[, `:=` (start = compDate - days(90), end = compDate + days(15))]
# create a second date in the 'dividends' datatable
dividends[, Date2 := divDate]

# set the keys for the two datatable
setkey(companies, Sedol, start, end)
setkey(dividends, Sedol, divDate, Date2)

# create a vector of columnnames which can be removed afterwards
deletecols <- c("Date2","start","end")

# perform the overlap join and remove the helper columns
res <- foverlaps(companies, dividends)[, (deletecols) := NULL]

the result:

> res
Sedol DivID divDate DivAmnt companyID compDate MktCap
1: 7A662B NA <NA> NA 6 2005-03-31 61.21061
2: 7A662B 5 2005-06-29 0.7772631 7 2005-06-30 66.92951
3: 7A662B 6 2005-06-30 1.1815343 7 2005-06-30 66.92951
4: 7A662B NA <NA> NA 8 2005-09-30 78.33914
5: 7A662B NA <NA> NA 9 2005-12-31 88.92473
6: 7A662B NA <NA> NA 10 2006-03-31 87.85067
7: 91772E 2 2005-01-13 0.2964291 1 2005-03-31 105.19249
8: 91772E 3 2005-01-29 0.8472649 1 2005-03-31 105.19249
9: 91772E NA <NA> NA 2 2005-06-30 108.74579
10: 91772E 4 2005-10-01 1.2467408 3 2005-09-30 113.42261
11: 91772E NA <NA> NA 4 2005-12-31 120.04491
12: 91772E NA <NA> NA 5 2006-03-31 124.35588

In the meantime the data.table authors have introduced non-equi joins (v1.9.8). You can also use that to solve this problem. Using a non-equi join you just need:

companies[, `:=` (start = compDate - days(90), end = compDate + days(15))]
dividends[companies, on = .(Sedol, divDate >= start, divDate <= end)]

to get the intended result.


Used data (the same as in the question, but without the creation of the keys):

set.seed(1337)
companies <- data.table(companyID = 1:10, Sedol = rep(c("91772E", "7A662B"), each = 5),
compDate = (as.Date("2005-04-01") + months(seq(0, 12, 3))) - days(1),
MktCap = c(100 + cumsum(rnorm(5,5)), 50 + cumsum(rnorm(5,1,5))))
dividends <- data.table(DivID = 1:7, Sedol = c(rep('91772E', each = 4), rep('7A662B', each = 3)),
divDate = as.Date(c('2004-11-19','2005-01-13','2005-01-29','2005-10-01','2005-06-29','2005-06-30','2006-04-17')),
DivAmnt = rnorm(7, .8, .3))

How to perform a inner roll join with data table?

To replicate an inner join you can use the nomatch argument.

df_test_2[df_test_1, roll = -Inf, nomatch=0]

See also:

rolling joins data.table in R

R - Data.Table Rolling Joins

R rolling join two data.tables with error margin on join

A data.table answer has been given here by user Uwe:

https://stackoverflow.com/a/62321710/12079387

Merging two sets of data by data.table roll='nearest' function

Here is a step-by-step example based on the sample data you give:

# Sample data
library(data.table)
setDT(set_A)
setDT(set_B)

# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]
setkey(set_A, time)
setkey(set_B, time)

# Rolling join by nearest time
set_merged <- set_B[set_A, roll = "nearest"]

unique(set_merged[order(ID_b)], by = "time")
# ID_b b1 b2 source time_b time ID_a a1 a2 a3
# 1: 2 34.2 15.114 set1.csv.1 20.35750 20.01000 8 85640 5274.1 301.6041
# 2: 7 67.2 16.114 set1.csv.2 21.35778 21.00972 7 85697 5345.2 301.6043
# 3: 12 12.2 33.114 set1.csv.3 22.35806 22.00972 4 65694 9375.2 301.6049
# 4: 17 73.2 67.114 set2.csv.1 23.35833 23.00972 3 85694 9278.9 301.6051
# 5: 23 88.2 42.114 set2.csv.2 19.35861 19.00972 5 85653 4375.5 301.6047
# 6: 28 90.2 52.114 set3.csv.1 0.35889 0.00944 2 35694 5245.2 301.6053
# time_a
# 1: 20.01000
# 2: 21.00972
# 3: 22.00972
# 4: 23.00972
# 5: 19.00972
# 6: 0.00944

Two comments:

  1. We create a new time column to avoid losing one of the original time columns from set_A and set_B. You can always remove the time column after the join if required.
  2. We use unique to remove duplicated time rows by order of ID_b. You mention in your post that "it doesn't really matter which row will be merged" but in case that you do want to retain specific rows, you might need to adjust this line of code.

Update (thanks to @Henrik)

As @Henrik pointed out, what you're after is actually a rolling join of set_A with respect to set_B, in which case you don't need to deal with the duplicate rows.

That translates to

library(data.table)
setDT(set_A)
setDT(set_B)

# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]

set_A[set_B, on = "time", roll = "nearest"][order(ID_a)]
# ID_a a1 a2 a3 time_a time ID_b b1 b2 source
#1: 2 35694 5245.2 301.6053 0.00944 0.35889 28 90.2 52.114 set3.csv.1
#2: 3 85694 9278.9 301.6051 23.00972 23.35833 17 73.2 67.114 set2.csv.1
#3: 5 85653 4375.5 301.6047 19.00972 19.35861 23 88.2 42.114 set2.csv.2
#4: 6 12694 5236.3 301.6045 22.00972 22.35806 12 12.2 33.114 set1.csv.3
#5: 7 85697 5345.2 301.6043 21.00972 21.35778 7 67.2 16.114 set1.csv.2
#6: 9 30694 5279.0 301.6039 20.01000 20.35750 2 34.2 15.114 set1.csv.1
# time_b
#1: 0.35889
#2: 23.35833
#3: 19.35861
#4: 22.35806
#5: 21.35778
#6: 20.35750

Sample data

set_A <- read.table(text =
"ID_a a1 a2 a3 time_a
2 35694 5245.2 301.6053 00.00944
3 85694 9278.9 301.6051 23.00972
4 65694 9375.2 301.6049 22.00972
5 85653 4375.5 301.6047 19.00972
6 12694 5236.3 301.6045 22.00972
7 85697 5345.2 301.6043 21.00972
8 85640 5274.1 301.6041 20.01000
9 30694 5279.0 301.6039 20.01000", header = T)

set_B <- read.table(text =
"ID_b b1 b2 source time_b
2 34.20 15.114 set1.csv.1 20.35750
7 67.20 16.114 set1.csv.2 21.35778
12 12.20 33.114 set1.csv.3 22.35806
17 73.20 67.114 set2.csv.1 23.35833
23 88.20 42.114 set2.csv.2 19.35861
28 90.20 52.114 set3.csv.1 00.35889", header = T)

Rolling join in Data.Table by two variables without creating duplicates

First of all, you could use unique instead of distinct
(the latter presumably from dplyr; you don't specify)
to avoid coercing the data table to a data frame.

You were pretty close,
but you need to switch the tables in the join,
i.e. something like df2[df1],
so that the rows from df1 are used as search keys,
and then you can use mult to remove duplicates.

Here's one way to do what you want with a non-equi join:

setkey(df1, departure)
setkey(df2, departure)

df1[, max_departure := departure + as.difftime(1, units = "hours")
][, observed_departure := df2[df1,
x.departure,
on = .(stop_id, departure >= departure, departure <= max_departure),
mult = "first"]
][, max_departure := NULL]

We order by departure (via setkey) so that mult = "first" returns the closest match in the future within what's allowed.
The intermediate column max_departure has to be assigned and subsequently removed because non-equi joins can only use existing columns.
Also note that the syntax used takes from this answer
(the version with .SD instead of df1 doesn't work in this case,
I don't know why).


EDIT: based on the comments,
it occurs to me that when you say "duplicated",
you might be referring to something different.
Say you have planned departures at 10 and 10:30,
but the one at 10 never takes place,
and an observed departure is 10:31.
Perhaps you mean that 10:31 is the observed departure for the one scheduled at 10:30,
and cannot be used for the one at 10?
If that's the case,
perhaps this will work:

setkey(df1, departure)
setkey(df2, departure)

max_dep <- function(departure) {
max_departure <- departure + as.difftime(1, units = "hours")

next_departure <- shift(departure,
fill = max_departure[length(max_departure)] + as.difftime(1, units = "secs"),
type = "lead")

invalid_max <- max_departure >= next_departure

max_departure[invalid_max] <- next_departure[invalid_max] - as.difftime(1, units = "secs")
max_departure
}

df1[, max_departure := max_dep(departure), by = "stop_id"
][, observed_departure := df2[df1,
x.departure,
on = .(stop_id, departure >= departure, departure <= max_departure),
mult = "first"]
][, max_departure := NULL]

The max_dep helper checks,
for each stop and scheduled departure,
what would be the next scheduled departure,
and sets max_departure as "next minus 1 second" if the next departure is within one hour.

The other solution wouldn't work for this because,
as long as an observed departure falls within one hour of the scheduled one,
it is a valid option.
In my example that means 10:31 would be valid for both 10:30 and 10.

rolling join two time series in data.table

You could use merge with all=T and setnafill with type='locf':

setnafill(merge(dt1,dt2,all=T),type="locf")[]

Key: <k>
k v.x v.y
<num> <num> <num>
1: 2 NA 11
2: 3 10 11
3: 4 10 13
4: 5 15 13
5: 6 15 6
6: 7 9 6
7: 9 7 6

Data.Table rolling join by group

Another option is to use the last function:

dt[, last(value[time < test.day]), by = .(loc.x, loc.y)]

which gives:

   loc.x loc.y V1
1: 1 1 a
2: 1 2 f
3: 3 1 c

Cartesian Rolling Join using Data.table

If I understand your intent correctly, you want to rollover the records for 90 days.
I used a cross join and then used the rollover criteria to subset

Your original tables:

library(data.table)

dates = structure(list(date = structure(c(17562, 17590, 17621, 17651,
17682, 17712, 17743, 17774, 17804, 17835, 17865, 17896), class = "Date")),
row.names = c(NA, -12L), class = "data.frame")

dat = structure(list(date = structure(c(17546, 17743, 17778, 17901,
17536, 17806, 17901, 17981, 17532, 17722, 17969, 18234), class = "Date"),
country = structure(c(1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 3L,
3L, 3L, 3L), .Label = c("AAA", "BBB", "CCC"), class = "factor"),
state = structure(c(1L, 1L, 2L, 3L, 4L, 1L, 2L, 5L, 6L, 1L,
2L, 2L), .Label = c("S1", "S2", "S3", "S4", "S5", "S6"), class = "factor"),
item = structure(c(1L, 2L, 4L, 6L, 3L, 5L, 3L, 2L, 2L, 4L,
5L, 7L), .Label = c("M1", "M2", "M3", "M4", "M5", "M6", "M7"
), class = "factor"), value = c(67L, 10L, 50L, 52L, 93L,
50L, 62L, 46L, 6L, 30L, 30L, 14L)), row.names = c(NA, -12L
), class = "data.frame")

dates = data.table(dates)
dat = data.table(dat)

Note, I haven't setkey.

I am using a cross join function from the reference: How to do cross join in R?

CJ.table.1 <- function(X,Y)
setkey(X[,c(k=1,.SD)],k)[Y[,c(k=1,.SD)],allow.cartesian=TRUE][,k:=NULL]

Then I cross join, subset for the roll join, rename columns and sort

dsn1<-CJ.table.1(dat,dates)[i.date-date<=90 & i.date-date>=0][,.(date=i.date,country, state, item, value)][order(country, state, item, value,date),]


Related Topics



Leave a reply



Submit