R data.table rolling join mult not working as expected
When running dt2[dt1, roll = "nearest"]
you are basically saying "return the rows from dt2
according to the nearest join to each row in dt1
using the key. So
dates2
in row one indt2
is the nearest todates1
in row one indt1
dates2
in row one indt2
is the nearest todates1
in row two indt1
dates2
in row two indt2
is the nearest todates1
in row three indt1
Hence,
dt2[dt1, roll = "nearest"]
# dates2 values2 values1
# 1: 2015-10-26 12:00:00 A a
# 2: 2015-10-26 13:00:00 A b
# 3: 2015-10-26 14:00:00 C c
Which are all the rows from dt1
with the joined values2
from dt2
.
Instead, we want to join the other way around, namely "extract values2
from dt2
according to the nearest join by each row in dt2
using the key and update the matched rows in dt1
", namely
dt1[dt2, roll = "nearest", values2 := i.values2]
dt1
# dates1 values1 values2
# 1: 2015-10-26 12:00:00 a A
# 2: 2015-10-26 13:00:00 b NA
# 3: 2015-10-26 14:00:00 c C
Some additional notes
- You don't need to wrap first to
data.frame
and then todata.table
, you can just dodt1 <- data.table(dates1, values1)
and etc. - While you at it, you can already set the key on the fly using
key
parameterdata.table
, namelydt1 <- data.table(dates1, values1, key = "dates1")
and etc. - Or you can skip setting keys all together and use
on
instead (V 1.9.6+), namelydt1[dt2, roll = "nearest", values2 := i.values2, on = c(dates1 = "dates2")]
- Finally, please refrain from making unnecessary copies, e.g., instead of
<-
anddata.table(df)
use:=
andsetDT(df)
, see here for more information
R rolling join two data.tables with error margin on join
A data.table
answer has been given here by user Uwe:
https://stackoverflow.com/a/62321710/12079387
Rolling join in Data.Table by two variables without creating duplicates
First of all, you could use unique
instead of distinct
(the latter presumably from dplyr
; you don't specify)
to avoid coercing the data table to a data frame.
You were pretty close,
but you need to switch the tables in the join,
i.e. something like df2[df1]
,
so that the rows from df1
are used as search keys,
and then you can use mult
to remove duplicates.
Here's one way to do what you want with a non-equi join:
setkey(df1, departure)
setkey(df2, departure)
df1[, max_departure := departure + as.difftime(1, units = "hours")
][, observed_departure := df2[df1,
x.departure,
on = .(stop_id, departure >= departure, departure <= max_departure),
mult = "first"]
][, max_departure := NULL]
We order by departure
(via setkey
) so that mult = "first"
returns the closest match in the future within what's allowed.
The intermediate column max_departure
has to be assigned and subsequently removed because non-equi joins can only use existing columns.
Also note that the syntax used takes from this answer
(the version with .SD
instead of df1
doesn't work in this case,
I don't know why).
EDIT: based on the comments,
it occurs to me that when you say "duplicated",
you might be referring to something different.
Say you have planned departures at 10 and 10:30,
but the one at 10 never takes place,
and an observed departure is 10:31.
Perhaps you mean that 10:31 is the observed departure for the one scheduled at 10:30,
and cannot be used for the one at 10?
If that's the case,
perhaps this will work:
setkey(df1, departure)
setkey(df2, departure)
max_dep <- function(departure) {
max_departure <- departure + as.difftime(1, units = "hours")
next_departure <- shift(departure,
fill = max_departure[length(max_departure)] + as.difftime(1, units = "secs"),
type = "lead")
invalid_max <- max_departure >= next_departure
max_departure[invalid_max] <- next_departure[invalid_max] - as.difftime(1, units = "secs")
max_departure
}
df1[, max_departure := max_dep(departure), by = "stop_id"
][, observed_departure := df2[df1,
x.departure,
on = .(stop_id, departure >= departure, departure <= max_departure),
mult = "first"]
][, max_departure := NULL]
The max_dep
helper checks,
for each stop and scheduled departure,
what would be the next scheduled departure,
and sets max_departure
as "next minus 1 second" if the next departure is within one hour.
The other solution wouldn't work for this because,
as long as an observed departure falls within one hour of the scheduled one,
it is a valid option.
In my example that means 10:31 would be valid for both 10:30 and 10.
Rolling join two data.tables with date in R
We may use non-equi join
dt1[dt2, date_2 := date2, on = .(group, date1 > date2), mult = "first"]
rolling joins in data.table with multiple matches
First Row
For the first row of X
only the first row of Y
has matching x
and y
so the first row of Y
will match iff Y$t[1]
is between X$t[1]
and X$t[1] + 0.005
but in fact Y$t[1] < X$t[1]
as seen here:
> X$t[1] - Y$t[1]
Time difference of 0.0009999275 secs
One would need a positive roll=
value whose magnitude is at least equal to the above difference in order to get a non-NA in row 1. e.g.
> Y[X, roll=0.001]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00 3
2: 1 TRUE 2013-06-20 08:00:00 NA
3: 1 TRUE 2013-06-20 08:00:00 NA
4: 2 FALSE 2013-06-20 08:00:00 NA
5: 2 FALSE 2013-06-20 08:00:00 NA
Note that you could force it by using rollends=
like this:
> Y[X, roll = -0.005, rollends = TRUE]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00 3
2: 1 TRUE 2013-06-20 08:00:00 NA
3: 1 TRUE 2013-06-20 08:00:00 2
4: 2 FALSE 2013-06-20 08:00:00 4
5: 2 FALSE 2013-06-20 08:00:00 6
Last Row
For the last row of X
only row 5 of Y
matches, not 5, 6 and 7, since only the nearest eligible row is a match. mult=
only applies to multiple matches and usually does not apply with roll=
(see example at the bottom for an exception):
Note also that rows 5, 6 and 7 of Y
do not have the same times. They have increasing times so there is no possibility that they all match:
> dput(Y[["t"]])
structure(c(1371729600.407, 1371729600.285, 1371729600.791, 1371729600.887,
1371729600.941, 1371729600.942, 1371729600.945), class = c("POSIXct",
"POSIXt"))
Even if rows 5, 6 and 7 of Y
have identical times if those times are not the same as the time in the last row of X
then one only gets one row out.
> # times in rows 5, 6 and 7 of Y2 are same
> Y2 <- copy(Y)
> Y2[, t:= t[c(1:4, 5, 5, 5)]]
> setkey(Y2, x, y, t)
> Y2[X, roll = -0.005]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00 NA
2: 1 TRUE 2013-06-20 08:00:00 NA
3: 1 TRUE 2013-06-20 08:00:00 2
4: 2 FALSE 2013-06-20 08:00:00 4
5: 2 FALSE 2013-06-20 08:00:00 6
Its only if rows 5, 6 and 7 of 'Y' have the same times and the last row of X
also has that time that one gets multiple times out and in that case mult=
can apply:
> # time in row 5 of X2 same as the times in rows 5, 6 and 7 of Y2
> X2 <- copy(X)
> X2[, t:=c(t[1:4], Y2[["t"]][5])]
> Y2[X2, roll = -0.005]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00 NA
2: 1 TRUE 2013-06-20 08:00:00 NA
3: 1 TRUE 2013-06-20 08:00:00 2
4: 2 FALSE 2013-06-20 08:00:00 4
5: 2 FALSE 2013-06-20 08:00:00 6
6: 2 FALSE 2013-06-20 08:00:00 7
7: 2 FALSE 2013-06-20 08:00:00 5
>
> Y2[X, roll = -0.005, mult = "first"]
x y t IDX
1: 1 FALSE 2013-06-20 08:00:00 NA
2: 1 TRUE 2013-06-20 08:00:00 NA
3: 1 TRUE 2013-06-20 08:00:00 2
4: 2 FALSE 2013-06-20 08:00:00 4
5: 2 FALSE 2013-06-20 08:00:00 6
How it works is not so clear from the documentation and I had to use trial and error to discover how it functions. ?data.table
does say that "Usually, there should be no duplicates in x's key" (in our example here x is Y
) so it may be that the developers wished to leave it undefined for this situation and open to future change.
The idea of using mult=
as you describe does seem like a very interesting idea but it seems not to be the way it currently works. Maybe it could in the future.
using roll in data.table join, can i force strictly one match
If your commercial times are sorted,
or you can sort them,
then you can use a non-equi join with a helper column with shifted times:
library(lubridate)
library(data.table)
ts <- seq(as.POSIXct("2017-01-01", tz = "UTC"),
as.POSIXct("2017-01-02", tz = "UTC"),
by = "30 min")
commercial <-
data.table(
c_row_number = 1:10,
c_time = ts[1:10],
c_next_time = shift(ts[1:10], type = "lead", fill = max(ts))
)
sale <-
data.table(
s_row_number = 1:4,
s_time = ts[5:8] + minutes(5),
s_time_join = ts[5:8] + minutes(5)
)
tbl_joined <- sale[commercial, on = .(s_time_join >= c_time, s_time_join < c_next_time)]
And if you want to use this idiom:
commercial[, s_time := sale[.SD,
.(s_time),
on = .(s_time_join >= c_time, s_time_join < c_next_time)]]
print(commercial)
c_row_number c_time c_next_time s_time
1: 1 2017-01-01 00:00:00 2017-01-01 00:30:00 <NA>
2: 2 2017-01-01 00:30:00 2017-01-01 01:00:00 <NA>
3: 3 2017-01-01 01:00:00 2017-01-01 01:30:00 <NA>
4: 4 2017-01-01 01:30:00 2017-01-01 02:00:00 <NA>
5: 5 2017-01-01 02:00:00 2017-01-01 02:30:00 2017-01-01 02:05:00
6: 6 2017-01-01 02:30:00 2017-01-01 03:00:00 2017-01-01 02:35:00
7: 7 2017-01-01 03:00:00 2017-01-01 03:30:00 2017-01-01 03:05:00
8: 8 2017-01-01 03:30:00 2017-01-01 04:00:00 2017-01-01 03:35:00
9: 9 2017-01-01 04:00:00 2017-01-01 04:30:00 <NA>
10: 10 2017-01-01 04:30:00 2017-01-02 00:00:00 <NA>
Merging two sets of data by data.table roll='nearest' function
Here is a step-by-step example based on the sample data you give:
# Sample data
library(data.table)
setDT(set_A)
setDT(set_B)
# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]
setkey(set_A, time)
setkey(set_B, time)
# Rolling join by nearest time
set_merged <- set_B[set_A, roll = "nearest"]
unique(set_merged[order(ID_b)], by = "time")
# ID_b b1 b2 source time_b time ID_a a1 a2 a3
# 1: 2 34.2 15.114 set1.csv.1 20.35750 20.01000 8 85640 5274.1 301.6041
# 2: 7 67.2 16.114 set1.csv.2 21.35778 21.00972 7 85697 5345.2 301.6043
# 3: 12 12.2 33.114 set1.csv.3 22.35806 22.00972 4 65694 9375.2 301.6049
# 4: 17 73.2 67.114 set2.csv.1 23.35833 23.00972 3 85694 9278.9 301.6051
# 5: 23 88.2 42.114 set2.csv.2 19.35861 19.00972 5 85653 4375.5 301.6047
# 6: 28 90.2 52.114 set3.csv.1 0.35889 0.00944 2 35694 5245.2 301.6053
# time_a
# 1: 20.01000
# 2: 21.00972
# 3: 22.00972
# 4: 23.00972
# 5: 19.00972
# 6: 0.00944
Two comments:
- We create a new
time
column to avoid losing one of the original time columns fromset_A
andset_B
. You can always remove thetime
column after the join if required. - We use
unique
to remove duplicatedtime
rows by order ofID_b
. You mention in your post that "it doesn't really matter which row will be merged" but in case that you do want to retain specific rows, you might need to adjust this line of code.
Update (thanks to @Henrik)
As @Henrik pointed out, what you're after is actually a rolling join of set_A
with respect to set_B
, in which case you don't need to deal with the duplicate rows.
That translates to
library(data.table)
setDT(set_A)
setDT(set_B)
# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]
set_A[set_B, on = "time", roll = "nearest"][order(ID_a)]
# ID_a a1 a2 a3 time_a time ID_b b1 b2 source
#1: 2 35694 5245.2 301.6053 0.00944 0.35889 28 90.2 52.114 set3.csv.1
#2: 3 85694 9278.9 301.6051 23.00972 23.35833 17 73.2 67.114 set2.csv.1
#3: 5 85653 4375.5 301.6047 19.00972 19.35861 23 88.2 42.114 set2.csv.2
#4: 6 12694 5236.3 301.6045 22.00972 22.35806 12 12.2 33.114 set1.csv.3
#5: 7 85697 5345.2 301.6043 21.00972 21.35778 7 67.2 16.114 set1.csv.2
#6: 9 30694 5279.0 301.6039 20.01000 20.35750 2 34.2 15.114 set1.csv.1
# time_b
#1: 0.35889
#2: 23.35833
#3: 19.35861
#4: 22.35806
#5: 21.35778
#6: 20.35750
Sample data
set_A <- read.table(text =
"ID_a a1 a2 a3 time_a
2 35694 5245.2 301.6053 00.00944
3 85694 9278.9 301.6051 23.00972
4 65694 9375.2 301.6049 22.00972
5 85653 4375.5 301.6047 19.00972
6 12694 5236.3 301.6045 22.00972
7 85697 5345.2 301.6043 21.00972
8 85640 5274.1 301.6041 20.01000
9 30694 5279.0 301.6039 20.01000", header = T)
set_B <- read.table(text =
"ID_b b1 b2 source time_b
2 34.20 15.114 set1.csv.1 20.35750
7 67.20 16.114 set1.csv.2 21.35778
12 12.20 33.114 set1.csv.3 22.35806
17 73.20 67.114 set2.csv.1 23.35833
23 88.20 42.114 set2.csv.2 19.35861
28 90.20 52.114 set3.csv.1 00.35889", header = T)
Nearest n rolling join in R data table
Here is something very raw (we go row by row):
n <- 2L
sen <- 1L:n
for (i in 1:nrow(dt1)) {
set(dt1, i, j = "nearest", list(which(frank(abs(dt1$x[i] - dt2$x)) %in% sen)))
}
dt1[, .(id1, nearest = unlist(nearest)), by = x
][, id2 := dt2$id2[nearest]
][, roll := paste0("nr", frank(abs(dt2$x[nearest] - x))), by = x][]
# x id1 nearest id2 roll
# 1: 15 x 1 a nr1
# 2: 15 x 2 b nr2
# 3: 101 y 2 b nr2
# 4: 101 y 3 c nr1
Slightly cleaner:
dt1[,
{
nrank <- frank(abs(x - dt2$x), ties.method="first")
nearest <- which(nrank %in% sen)
.(x = x, id2 = dt2$id2[nearest], roll = paste0("nr", nrank[nearest]))
},
by = id1] # assumes unique ids.
Data:
dt1 <- data.table(x = c(15, 101), id1 = c("x", "y"))
dt2 <- data.table(x = c(10, 50, 100, 200), id2 = c("a", "b", "c", "d"))
EDIT (as suggested/written by OP)
Joining with multiple keys:
dt1[,
{
g <- group
dt_tmp <- dt2[dt2$group == g]
nrank <- frank(abs(x - dt_tmp$x), ties.method="first")
nearest <- which(nrank %in% sen)
.(x = x, id2 = dt_tmp$id2[nearest], roll = paste0("nr", nrank[nearest]))
},
by = id1]
data.table join + update with mult='first' gives unexpected result
Reading the documentation for data.table
's mult
more closely, it says that:
When
i
is alist
(ordata.frame
ordata.table
) and multiple rows inx
match to the row ini
,mult
controls which are returned:"all"
(default),"first"
or"last"
.
So if there are multiple rows in x
("users") that match to i
("transactions"), then mult
will return the first row in x
. However, in your case, there aren't multiple rows in x
that match to i
, rather there are multiple rows in i
that match to x
.
As @Arun suggested, the best option would be change around your so that mult = "first"
is relevant:
users[, FirstTransactionDate := transactions[users, TransactionDate, on="UserID", mult = "first"]]
users
# UserID User Gender Registered Cancelled FirstTransactionDate
#1: 1 Charles male 2012-12-21 <NA> 2012-08-26
#2: 2 Pedro male 2010-08-01 2010-08-08 2013-06-06
#3: 3 Caroline female 2012-10-23 2016-06-07 2011-05-26
#4: 4 Brielle female 2013-07-17 <NA> <NA>
#5: 5 Benjamin male 2010-11-25 <NA> <NA>
Another option would be to change up your merge slightly:
users[transactions[,FirstTransactionDate := min(TransactionDate), by = UserID],
FirstTransactionDate := FirstTransactionDate, on="UserID"]
I just create the first transaction date within the transactions
dataset. This gets merged on multiple times, but it should be fine because it's always the same value for a UserID
.
Related Topics
R - How to Use Selectinput in Shiny to Change the X and Fill Variables in a Ggplot Renderplot
Loading Dplyr After Plyr Is Causing Issues
Different Colors with Gradient for Subgroups on a Treemap Ggplot2 R
Remove Numbers at the Beginning and End of a String
Filled.Contour in R 3.0.X Throws Error
How to Use a Character Vector of Column Names in the Formula Argument of Dcast (Reshape2)
Replace Rbind in For-Loop with Lapply? (2Nd Circle of Hell)
Installing R on Osx Big Sur (Edit: and Apple M1) for Use with Rcpp and Openmp
How to Use Write.Table() and Ddply, Together
How to Calculate the Distance Between Latitude and Longitude Along Rows of Columns in R
R Corpus Is Messing Up My Utf-8 Encoded Text
Randomly Sample Data Frame into 3 Groups in R
How to Install R-Packages Not in the Conda Repositories
Data Difference in 'As.Posixct' with Excel