R Data.Table Rolling Join "Mult" Not Working as Expected

R data.table rolling join mult not working as expected

When running dt2[dt1, roll = "nearest"] you are basically saying "return the rows from dt2 according to the nearest join to each row in dt1 using the key. So

  • dates2 in row one in dt2 is the nearest to dates1 in row one in dt1
  • dates2 in row one in dt2 is the nearest to dates1 in row two in dt1
  • dates2 in row two in dt2 is the nearest to dates1 in row three in dt1

Hence,

dt2[dt1, roll = "nearest"]
# dates2 values2 values1
# 1: 2015-10-26 12:00:00 A a
# 2: 2015-10-26 13:00:00 A b
# 3: 2015-10-26 14:00:00 C c

Which are all the rows from dt1 with the joined values2 from dt2.


Instead, we want to join the other way around, namely "extract values2 from dt2 according to the nearest join by each row in dt2 using the key and update the matched rows in dt1", namely

dt1[dt2, roll = "nearest", values2 := i.values2] 
dt1
# dates1 values1 values2
# 1: 2015-10-26 12:00:00 a A
# 2: 2015-10-26 13:00:00 b NA
# 3: 2015-10-26 14:00:00 c C

Some additional notes

  • You don't need to wrap first to data.frame and then to data.table, you can just do dt1 <- data.table(dates1, values1) and etc.
  • While you at it, you can already set the key on the fly using key parameter data.table, namely dt1 <- data.table(dates1, values1, key = "dates1") and etc.
  • Or you can skip setting keys all together and use on instead (V 1.9.6+), namely dt1[dt2, roll = "nearest", values2 := i.values2, on = c(dates1 = "dates2")]
  • Finally, please refrain from making unnecessary copies, e.g., instead of <- and data.table(df) use := and setDT(df), see here for more information

R rolling join two data.tables with error margin on join

A data.table answer has been given here by user Uwe:

https://stackoverflow.com/a/62321710/12079387

Rolling join in Data.Table by two variables without creating duplicates

First of all, you could use unique instead of distinct
(the latter presumably from dplyr; you don't specify)
to avoid coercing the data table to a data frame.

You were pretty close,
but you need to switch the tables in the join,
i.e. something like df2[df1],
so that the rows from df1 are used as search keys,
and then you can use mult to remove duplicates.

Here's one way to do what you want with a non-equi join:

setkey(df1, departure)
setkey(df2, departure)

df1[, max_departure := departure + as.difftime(1, units = "hours")
][, observed_departure := df2[df1,
x.departure,
on = .(stop_id, departure >= departure, departure <= max_departure),
mult = "first"]
][, max_departure := NULL]

We order by departure (via setkey) so that mult = "first" returns the closest match in the future within what's allowed.
The intermediate column max_departure has to be assigned and subsequently removed because non-equi joins can only use existing columns.
Also note that the syntax used takes from this answer
(the version with .SD instead of df1 doesn't work in this case,
I don't know why).


EDIT: based on the comments,
it occurs to me that when you say "duplicated",
you might be referring to something different.
Say you have planned departures at 10 and 10:30,
but the one at 10 never takes place,
and an observed departure is 10:31.
Perhaps you mean that 10:31 is the observed departure for the one scheduled at 10:30,
and cannot be used for the one at 10?
If that's the case,
perhaps this will work:

setkey(df1, departure)
setkey(df2, departure)

max_dep <- function(departure) {
max_departure <- departure + as.difftime(1, units = "hours")

next_departure <- shift(departure,
fill = max_departure[length(max_departure)] + as.difftime(1, units = "secs"),
type = "lead")

invalid_max <- max_departure >= next_departure

max_departure[invalid_max] <- next_departure[invalid_max] - as.difftime(1, units = "secs")
max_departure
}

df1[, max_departure := max_dep(departure), by = "stop_id"
][, observed_departure := df2[df1,
x.departure,
on = .(stop_id, departure >= departure, departure <= max_departure),
mult = "first"]
][, max_departure := NULL]

The max_dep helper checks,
for each stop and scheduled departure,
what would be the next scheduled departure,
and sets max_departure as "next minus 1 second" if the next departure is within one hour.

The other solution wouldn't work for this because,
as long as an observed departure falls within one hour of the scheduled one,
it is a valid option.
In my example that means 10:31 would be valid for both 10:30 and 10.

Rolling join two data.tables with date in R

We may use non-equi join

dt1[dt2, date_2 := date2, on = .(group, date1 > date2), mult = "first"]

rolling joins in data.table with multiple matches

First Row

For the first row of X only the first row of Y has matching x and y so the first row of Y will match iff Y$t[1] is between X$t[1] and X$t[1] + 0.005 but in fact Y$t[1] < X$t[1] as seen here:

> X$t[1] - Y$t[1]
Time difference of 0.0009999275 secs

One would need a positive roll= value whose magnitude is at least equal to the above difference in order to get a non-NA in row 1. e.g.

> Y[X, roll=0.001]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00 3
2: 1 TRUE 2013-06-20 08:00:00 NA
3: 1 TRUE 2013-06-20 08:00:00 NA
4: 2 FALSE 2013-06-20 08:00:00 NA
5: 2 FALSE 2013-06-20 08:00:00 NA

Note that you could force it by using rollends= like this:

> Y[X, roll = -0.005, rollends = TRUE]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00 3
2: 1 TRUE 2013-06-20 08:00:00 NA
3: 1 TRUE 2013-06-20 08:00:00 2
4: 2 FALSE 2013-06-20 08:00:00 4
5: 2 FALSE 2013-06-20 08:00:00 6

Last Row

For the last row of X only row 5 of Y matches, not 5, 6 and 7, since only the nearest eligible row is a match. mult= only applies to multiple matches and usually does not apply with roll= (see example at the bottom for an exception):

Note also that rows 5, 6 and 7 of Y do not have the same times. They have increasing times so there is no possibility that they all match:

> dput(Y[["t"]])
structure(c(1371729600.407, 1371729600.285, 1371729600.791, 1371729600.887,
1371729600.941, 1371729600.942, 1371729600.945), class = c("POSIXct",
"POSIXt"))

Even if rows 5, 6 and 7 of Y have identical times if those times are not the same as the time in the last row of X then one only gets one row out.

> # times in rows 5, 6 and 7 of Y2 are same
> Y2 <- copy(Y)
> Y2[, t:= t[c(1:4, 5, 5, 5)]]
> setkey(Y2, x, y, t)
> Y2[X, roll = -0.005]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00 NA
2: 1 TRUE 2013-06-20 08:00:00 NA
3: 1 TRUE 2013-06-20 08:00:00 2
4: 2 FALSE 2013-06-20 08:00:00 4
5: 2 FALSE 2013-06-20 08:00:00 6

Its only if rows 5, 6 and 7 of 'Y' have the same times and the last row of X also has that time that one gets multiple times out and in that case mult= can apply:

> # time in row 5 of X2 same as the times in rows 5, 6 and 7 of Y2
> X2 <- copy(X)
> X2[, t:=c(t[1:4], Y2[["t"]][5])]
> Y2[X2, roll = -0.005]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00 NA
2: 1 TRUE 2013-06-20 08:00:00 NA
3: 1 TRUE 2013-06-20 08:00:00 2
4: 2 FALSE 2013-06-20 08:00:00 4
5: 2 FALSE 2013-06-20 08:00:00 6
6: 2 FALSE 2013-06-20 08:00:00 7
7: 2 FALSE 2013-06-20 08:00:00 5
>
> Y2[X, roll = -0.005, mult = "first"]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00 NA
2: 1 TRUE 2013-06-20 08:00:00 NA
3: 1 TRUE 2013-06-20 08:00:00 2
4: 2 FALSE 2013-06-20 08:00:00 4
5: 2 FALSE 2013-06-20 08:00:00 6

How it works is not so clear from the documentation and I had to use trial and error to discover how it functions. ?data.table does say that "Usually, there should be no duplicates in x's key" (in our example here x is Y) so it may be that the developers wished to leave it undefined for this situation and open to future change.

The idea of using mult= as you describe does seem like a very interesting idea but it seems not to be the way it currently works. Maybe it could in the future.

using roll in data.table join, can i force strictly one match

If your commercial times are sorted,
or you can sort them,
then you can use a non-equi join with a helper column with shifted times:

library(lubridate)
library(data.table)

ts <- seq(as.POSIXct("2017-01-01", tz = "UTC"),
as.POSIXct("2017-01-02", tz = "UTC"),
by = "30 min")

commercial <-
data.table(
c_row_number = 1:10,
c_time = ts[1:10],
c_next_time = shift(ts[1:10], type = "lead", fill = max(ts))
)

sale <-
data.table(
s_row_number = 1:4,
s_time = ts[5:8] + minutes(5),
s_time_join = ts[5:8] + minutes(5)
)

tbl_joined <- sale[commercial, on = .(s_time_join >= c_time, s_time_join < c_next_time)]

And if you want to use this idiom:

commercial[, s_time := sale[.SD,
.(s_time),
on = .(s_time_join >= c_time, s_time_join < c_next_time)]]
print(commercial)
c_row_number c_time c_next_time s_time
1: 1 2017-01-01 00:00:00 2017-01-01 00:30:00 <NA>
2: 2 2017-01-01 00:30:00 2017-01-01 01:00:00 <NA>
3: 3 2017-01-01 01:00:00 2017-01-01 01:30:00 <NA>
4: 4 2017-01-01 01:30:00 2017-01-01 02:00:00 <NA>
5: 5 2017-01-01 02:00:00 2017-01-01 02:30:00 2017-01-01 02:05:00
6: 6 2017-01-01 02:30:00 2017-01-01 03:00:00 2017-01-01 02:35:00
7: 7 2017-01-01 03:00:00 2017-01-01 03:30:00 2017-01-01 03:05:00
8: 8 2017-01-01 03:30:00 2017-01-01 04:00:00 2017-01-01 03:35:00
9: 9 2017-01-01 04:00:00 2017-01-01 04:30:00 <NA>
10: 10 2017-01-01 04:30:00 2017-01-02 00:00:00 <NA>

Merging two sets of data by data.table roll='nearest' function

Here is a step-by-step example based on the sample data you give:

# Sample data
library(data.table)
setDT(set_A)
setDT(set_B)

# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]
setkey(set_A, time)
setkey(set_B, time)

# Rolling join by nearest time
set_merged <- set_B[set_A, roll = "nearest"]

unique(set_merged[order(ID_b)], by = "time")
# ID_b b1 b2 source time_b time ID_a a1 a2 a3
# 1: 2 34.2 15.114 set1.csv.1 20.35750 20.01000 8 85640 5274.1 301.6041
# 2: 7 67.2 16.114 set1.csv.2 21.35778 21.00972 7 85697 5345.2 301.6043
# 3: 12 12.2 33.114 set1.csv.3 22.35806 22.00972 4 65694 9375.2 301.6049
# 4: 17 73.2 67.114 set2.csv.1 23.35833 23.00972 3 85694 9278.9 301.6051
# 5: 23 88.2 42.114 set2.csv.2 19.35861 19.00972 5 85653 4375.5 301.6047
# 6: 28 90.2 52.114 set3.csv.1 0.35889 0.00944 2 35694 5245.2 301.6053
# time_a
# 1: 20.01000
# 2: 21.00972
# 3: 22.00972
# 4: 23.00972
# 5: 19.00972
# 6: 0.00944

Two comments:

  1. We create a new time column to avoid losing one of the original time columns from set_A and set_B. You can always remove the time column after the join if required.
  2. We use unique to remove duplicated time rows by order of ID_b. You mention in your post that "it doesn't really matter which row will be merged" but in case that you do want to retain specific rows, you might need to adjust this line of code.

Update (thanks to @Henrik)

As @Henrik pointed out, what you're after is actually a rolling join of set_A with respect to set_B, in which case you don't need to deal with the duplicate rows.

That translates to

library(data.table)
setDT(set_A)
setDT(set_B)

# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]

set_A[set_B, on = "time", roll = "nearest"][order(ID_a)]
# ID_a a1 a2 a3 time_a time ID_b b1 b2 source
#1: 2 35694 5245.2 301.6053 0.00944 0.35889 28 90.2 52.114 set3.csv.1
#2: 3 85694 9278.9 301.6051 23.00972 23.35833 17 73.2 67.114 set2.csv.1
#3: 5 85653 4375.5 301.6047 19.00972 19.35861 23 88.2 42.114 set2.csv.2
#4: 6 12694 5236.3 301.6045 22.00972 22.35806 12 12.2 33.114 set1.csv.3
#5: 7 85697 5345.2 301.6043 21.00972 21.35778 7 67.2 16.114 set1.csv.2
#6: 9 30694 5279.0 301.6039 20.01000 20.35750 2 34.2 15.114 set1.csv.1
# time_b
#1: 0.35889
#2: 23.35833
#3: 19.35861
#4: 22.35806
#5: 21.35778
#6: 20.35750

Sample data

set_A <- read.table(text =
"ID_a a1 a2 a3 time_a
2 35694 5245.2 301.6053 00.00944
3 85694 9278.9 301.6051 23.00972
4 65694 9375.2 301.6049 22.00972
5 85653 4375.5 301.6047 19.00972
6 12694 5236.3 301.6045 22.00972
7 85697 5345.2 301.6043 21.00972
8 85640 5274.1 301.6041 20.01000
9 30694 5279.0 301.6039 20.01000", header = T)

set_B <- read.table(text =
"ID_b b1 b2 source time_b
2 34.20 15.114 set1.csv.1 20.35750
7 67.20 16.114 set1.csv.2 21.35778
12 12.20 33.114 set1.csv.3 22.35806
17 73.20 67.114 set2.csv.1 23.35833
23 88.20 42.114 set2.csv.2 19.35861
28 90.20 52.114 set3.csv.1 00.35889", header = T)

Nearest n rolling join in R data table

Here is something very raw (we go row by row):

n <- 2L
sen <- 1L:n
for (i in 1:nrow(dt1)) {
set(dt1, i, j = "nearest", list(which(frank(abs(dt1$x[i] - dt2$x)) %in% sen)))
}
dt1[, .(id1, nearest = unlist(nearest)), by = x
][, id2 := dt2$id2[nearest]
][, roll := paste0("nr", frank(abs(dt2$x[nearest] - x))), by = x][]

# x id1 nearest id2 roll
# 1: 15 x 1 a nr1
# 2: 15 x 2 b nr2
# 3: 101 y 2 b nr2
# 4: 101 y 3 c nr1

Slightly cleaner:

dt1[, 
{
nrank <- frank(abs(x - dt2$x), ties.method="first")
nearest <- which(nrank %in% sen)
.(x = x, id2 = dt2$id2[nearest], roll = paste0("nr", nrank[nearest]))
},
by = id1] # assumes unique ids.

Data:

dt1 <- data.table(x = c(15, 101), id1 = c("x", "y"))
dt2 <- data.table(x = c(10, 50, 100, 200), id2 = c("a", "b", "c", "d"))

EDIT (as suggested/written by OP)
Joining with multiple keys:

dt1[, 
{
g <- group
dt_tmp <- dt2[dt2$group == g]
nrank <- frank(abs(x - dt_tmp$x), ties.method="first")
nearest <- which(nrank %in% sen)
.(x = x, id2 = dt_tmp$id2[nearest], roll = paste0("nr", nrank[nearest]))
},
by = id1]

data.table join + update with mult='first' gives unexpected result

Reading the documentation for data.table's mult more closely, it says that:

When i is a list (or data.frame or data.table) and multiple rows in x
match to the row in i, mult controls which are returned: "all"
(default), "first" or "last".

So if there are multiple rows in x ("users") that match to i ("transactions"), then mult will return the first row in x. However, in your case, there aren't multiple rows in x that match to i, rather there are multiple rows in i that match to x.

As @Arun suggested, the best option would be change around your so that mult = "first" is relevant:

users[, FirstTransactionDate := transactions[users, TransactionDate, on="UserID", mult = "first"]]

users
# UserID User Gender Registered Cancelled FirstTransactionDate
#1: 1 Charles male 2012-12-21 <NA> 2012-08-26
#2: 2 Pedro male 2010-08-01 2010-08-08 2013-06-06
#3: 3 Caroline female 2012-10-23 2016-06-07 2011-05-26
#4: 4 Brielle female 2013-07-17 <NA> <NA>
#5: 5 Benjamin male 2010-11-25 <NA> <NA>

Another option would be to change up your merge slightly:

users[transactions[,FirstTransactionDate := min(TransactionDate), by = UserID],
FirstTransactionDate := FirstTransactionDate, on="UserID"]

I just create the first transaction date within the transactions dataset. This gets merged on multiple times, but it should be fine because it's always the same value for a UserID.



Related Topics



Leave a reply



Submit