Rolling join on data.table with duplicate keys
The reason that t_id = 1
doesn't show up in the output is because a rolling join takes the row where the key-combination occurs last. From the documentation (emphasis mine):
Applies to the last join column, generally a date but can be any
ordered variable, irregular and including gaps. If roll=TRUE and i's
row matches to all but the last x join column, and its value in the
last i join column falls in a gap (including after the last
observation in x for that group), then the prevailing value in x is
rolled forward. This operation is particularly fast using a modified
binary search. The operation is also known as last observation carried
forward (LOCF).
Let's consider somewhat larger datasets:
> DT
t_id airport thisTime
1: 1 a 5.1
2: 4 a 5.1
3: 3 a 5.1
4: 2 d 6.2
5: 5 d 6.2
> DT_LU
f_id airport thisTime
1: 1 a 6
2: 2 a 6
3: 2 a 8
4: 1 b 7
5: 1 c 8
6: 2 d 7
7: 1 d 9
When you perform a rolling join just like in your question:
DT[DT_LU, nomatch=0, roll=Inf]
you get:
t_id airport thisTime f_id
1: 3 a 6 1
2: 3 a 6 2
3: 3 a 8 2
4: 5 d 7 2
5: 5 d 9 1
As you can see, from both the key combination a, 5.1
and d, 6.2
the last row is used for the joined datatable. Because you use Inf
as roll-value, all the future values are incorporated in the resulting datatable. When you use:
DT[DT_LU, nomatch=0, roll=1]
you see that only the first value in the future is included:
t_id airport thisTime f_id
1: 3 a 6 1
2: 3 a 6 2
3: 5 d 7 2
If you want the f_id
's for for all combinations of airport
& thisTime
where DT$thisTime
is lower than DT_LU$thisTime
, you can achieve that by creating a new variable (or replacing the existing thisTime
) by means of the ceiling
function. An example where I create a new variable thisTime2
and then do a normal join with DT_LU
:
DT[, thisTime2 := ceiling(thisTime)]
setkey(DT, airport, thisTime2)[DT_LU, nomatch=0]
which gives:
t_id airport thisTime thisTime2 f_id
1: 1 a 5.1 6 1
2: 4 a 5.1 6 1
3: 3 a 5.1 6 1
4: 1 a 5.1 6 2
5: 4 a 5.1 6 2
6: 3 a 5.1 6 2
7: 2 d 6.2 7 2
8: 5 d 6.2 7 2
Applied to the data you provided:
> dt[, thisTime2 := ceiling(thisTime)]
> setkey(dt, airport, thisTime2)[dt_lookup, nomatch=0]
t_id airport thisTime thisTime2 f_id
1: 1 a 5.1 6 1
2: 3 a 5.1 6 1
3: 1 a 5.1 6 2
4: 3 a 5.1 6 2
When you want to include al the future values instead of only the first one, you need a somewhat different approach for which you will need the i.col
functionality (which is not documented yet):
1: First set the key to only the airport
columns:
setkey(DT, airport)
setkey(DT_LU, airport)
2: Use the i.col
functionality (which is not documented yet) in j
to get what you want as follows:
DT1 <- DT_LU[DT, .(tid = i.t_id,
tTime = i.thisTime,
fTime = thisTime[i.thisTime < thisTime],
fid = f_id[i.thisTime < thisTime]),
by=.EACHI]
this gives you:
> DT1
airport tid tTime fTime fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
3: a 1 5.1 8 2
4: a 4 5.1 6 1
5: a 4 5.1 6 2
6: a 4 5.1 8 2
7: a 3 5.1 6 1
8: a 3 5.1 6 2
9: a 3 5.1 8 2
10: d 2 6.2 7 2
11: d 2 6.2 9 1
12: d 5 6.2 7 2
13: d 5 6.2 9 1
Some explanation: In case when you are joining two datatables where the same columnnames are used, you can refer to the columns of the datatable in i
by preceding the columnnames with i.
. Now it's possible to compare thisTime
from DT
with thisTime
from DT_LU
. With by = .EACHI
you assure that all combinations for with the condition holds are included in the resulting datatable.
Alternatively, you can achieve the same with:
DT2 <- DT_LU[DT, .(airport=i.airport,
tid=i.t_id,
tTime=i.thisTime,
fTime=thisTime[i.thisTime < thisTime],
fid=f_id[i.thisTime < thisTime]),
allow.cartesian=TRUE]
which gives the same result:
> identical(DT1, DT2)
[1] TRUE
When you only want to include future values within a certain boundary, you can use:
DT1 <- DT_LU[DT,
{
idx = i.thisTime < thisTime & thisTime - i.thisTime < 2
.(tid = i.t_id,
tTime = i.thisTime,
fTime = thisTime[idx],
fid = f_id[idx])
},
by=.EACHI]
which gives:
> DT1
airport tid tTime fTime fid
1: a 1 5.1 6 1
2: a 1 5.1 6 2
3: a 4 5.1 6 1
4: a 4 5.1 6 2
5: a 3 5.1 6 1
6: a 3 5.1 6 2
7: d 2 6.2 7 2
8: d 5 6.2 7 2
When you compare that to the previous result, you see that now the rows 3, 6, 9, 10 and 12 have been removed.
Data:
DT <- data.table(t_id = c(1,4,2,3,5),
airport = c("a","a","d","a","d"),
thisTime = c(5.1, 5.1, 6.2, 5.1, 6.2),
key=c("airport","thisTime"))
DT_LU <- data.table(f_id = c(rep(1,4),rep(2,3)),
airport = c("a","b","c","d","a","d","e"),
thisTime = c(6,7,8,9,6,7,8),
key=c("airport","thisTime"))
rolling joins with duplicate keys
If you do:
slavedf[masterdf, roll = T]
This gives:
timeref mykey time var
1: 2019-01-29 2019-02-01 2019-02-01 1
2: 2019-01-29 2019-02-01 2019-02-01 2
3: 2019-01-29 2019-03-01 2019-03-01 3
Let's take for our main data frame df
(X
) and the secondary data frame df1
(Y
).
df <- data.frame(A = c("A", "A", "A", "B", "B"),
B = c(2, 2, 3, 4, 3)
)
df1 <- data.frame(A = c("A", "B", "C"),
val = c(1000, 100, 500)
)
Left join with dplyr
:
df %>% left_join(df1, by = "A")
A B val
1 A 2 1000
2 A 2 1000
3 A 3 1000
4 B 4 100
5 B 3 100
What is then the left join in data.table
? Perhaps:
setDT(df)
setDT(df1)
df[df1, on = "A"]
A B val
1: A 2 1000
2: A 2 1000
3: A 3 1000
4: B 4 100
5: B 3 100
6: C NA 500
Perhaps not - what about:
setDT(df)
setDT(df1)
df1[df, on = "A"]
A val B
1: A 1000 2
2: A 1000 2
3: A 1000 3
4: B 100 4
5: B 100 3
Yep! This is also explained in this vignette as X[DT, on="x"] # left join
.
Rolling join in Data.Table by two variables without creating duplicates
First of all, you could use unique
instead of distinct
(the latter presumably from dplyr
; you don't specify)
to avoid coercing the data table to a data frame.
You were pretty close,
but you need to switch the tables in the join,
i.e. something like df2[df1]
,
so that the rows from df1
are used as search keys,
and then you can use mult
to remove duplicates.
Here's one way to do what you want with a non-equi join:
setkey(df1, departure)
setkey(df2, departure)
df1[, max_departure := departure + as.difftime(1, units = "hours")
][, observed_departure := df2[df1,
x.departure,
on = .(stop_id, departure >= departure, departure <= max_departure),
mult = "first"]
][, max_departure := NULL]
We order by departure
(via setkey
) so that mult = "first"
returns the closest match in the future within what's allowed.
The intermediate column max_departure
has to be assigned and subsequently removed because non-equi joins can only use existing columns.
Also note that the syntax used takes from this answer
(the version with .SD
instead of df1
doesn't work in this case,
I don't know why).
EDIT: based on the comments,
it occurs to me that when you say "duplicated",
you might be referring to something different.
Say you have planned departures at 10 and 10:30,
but the one at 10 never takes place,
and an observed departure is 10:31.
Perhaps you mean that 10:31 is the observed departure for the one scheduled at 10:30,
and cannot be used for the one at 10?
If that's the case,
perhaps this will work:
setkey(df1, departure)
setkey(df2, departure)
max_dep <- function(departure) {
max_departure <- departure + as.difftime(1, units = "hours")
next_departure <- shift(departure,
fill = max_departure[length(max_departure)] + as.difftime(1, units = "secs"),
type = "lead")
invalid_max <- max_departure >= next_departure
max_departure[invalid_max] <- next_departure[invalid_max] - as.difftime(1, units = "secs")
max_departure
}
df1[, max_departure := max_dep(departure), by = "stop_id"
][, observed_departure := df2[df1,
x.departure,
on = .(stop_id, departure >= departure, departure <= max_departure),
mult = "first"]
][, max_departure := NULL]
The max_dep
helper checks,
for each stop and scheduled departure,
what would be the next scheduled departure,
and sets max_departure
as "next minus 1 second" if the next departure is within one hour.
The other solution wouldn't work for this because,
as long as an observed departure falls within one hour of the scheduled one,
it is a valid option.
In my example that means 10:31 would be valid for both 10:30 and 10.
R data.table rolling join mult not working as expected
When running dt2[dt1, roll = "nearest"]
you are basically saying "return the rows from dt2
according to the nearest join to each row in dt1
using the key. So
dates2
in row one indt2
is the nearest todates1
in row one indt1
dates2
in row one indt2
is the nearest todates1
in row two indt1
dates2
in row two indt2
is the nearest todates1
in row three indt1
Hence,
dt2[dt1, roll = "nearest"]
# dates2 values2 values1
# 1: 2015-10-26 12:00:00 A a
# 2: 2015-10-26 13:00:00 A b
# 3: 2015-10-26 14:00:00 C c
Which are all the rows from dt1
with the joined values2
from dt2
.
Instead, we want to join the other way around, namely "extract values2
from dt2
according to the nearest join by each row in dt2
using the key and update the matched rows in dt1
", namely
dt1[dt2, roll = "nearest", values2 := i.values2]
dt1
# dates1 values1 values2
# 1: 2015-10-26 12:00:00 a A
# 2: 2015-10-26 13:00:00 b NA
# 3: 2015-10-26 14:00:00 c C
Some additional notes
- You don't need to wrap first to
data.frame
and then todata.table
, you can just dodt1 <- data.table(dates1, values1)
and etc. - While you at it, you can already set the key on the fly using
key
parameterdata.table
, namelydt1 <- data.table(dates1, values1, key = "dates1")
and etc. - Or you can skip setting keys all together and use
on
instead (V 1.9.6+), namelydt1[dt2, roll = "nearest", values2 := i.values2, on = c(dates1 = "dates2")]
- Finally, please refrain from making unnecessary copies, e.g., instead of
<-
anddata.table(df)
use:=
andsetDT(df)
, see here for more information
Roll join with no duplicates in R
Here's what I ended up doing, I don't think it's awesome but as far as I can see, it is working as expected.
q1$joint_time <- q1$datetime
q2$joint_time <- q2$timepoint
# create a sample id using the key since the data is grouped
q2$sample_id <- paste0(q2$key, as.character(1:nrow(q2)))
# Join
res <- q2[q1, roll="nearest"]
# fill with NAs
res %>% mutate_at(vars(y,timepoint), ~ifelse(duplicated(sample_id), NA, .))
Which produces
key timepoint y joint_time sample_id datetime x
1: A 10 10.1 0 A1 0 1.1
2: A 40 10.2 30 A2 30 1.2
3: A 110 10.4 90 A3 90 1.3
4: A NA NA 120 A3 120 1.4
5: B 30 -50.0 15 B4 15 -5.0
6: B NA NA 45 B4 45 -3.0
7: B 90 -30.0 75 B5 75 -3.5
8: C 5 0.0 10 C6 10 0.0
9: C 35 -10.4 40 C7 40 -1.4
10: C 76 -10.0 70 C8 70 -1.0
Merging two sets of data by data.table roll='nearest' function
Here is a step-by-step example based on the sample data you give:
# Sample data
library(data.table)
setDT(set_A)
setDT(set_B)
# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]
setkey(set_A, time)
setkey(set_B, time)
# Rolling join by nearest time
set_merged <- set_B[set_A, roll = "nearest"]
unique(set_merged[order(ID_b)], by = "time")
# ID_b b1 b2 source time_b time ID_a a1 a2 a3
# 1: 2 34.2 15.114 set1.csv.1 20.35750 20.01000 8 85640 5274.1 301.6041
# 2: 7 67.2 16.114 set1.csv.2 21.35778 21.00972 7 85697 5345.2 301.6043
# 3: 12 12.2 33.114 set1.csv.3 22.35806 22.00972 4 65694 9375.2 301.6049
# 4: 17 73.2 67.114 set2.csv.1 23.35833 23.00972 3 85694 9278.9 301.6051
# 5: 23 88.2 42.114 set2.csv.2 19.35861 19.00972 5 85653 4375.5 301.6047
# 6: 28 90.2 52.114 set3.csv.1 0.35889 0.00944 2 35694 5245.2 301.6053
# time_a
# 1: 20.01000
# 2: 21.00972
# 3: 22.00972
# 4: 23.00972
# 5: 19.00972
# 6: 0.00944
Two comments:
- We create a new
time
column to avoid losing one of the original time columns fromset_A
andset_B
. You can always remove thetime
column after the join if required. - We use
unique
to remove duplicatedtime
rows by order ofID_b
. You mention in your post that "it doesn't really matter which row will be merged" but in case that you do want to retain specific rows, you might need to adjust this line of code.
Update (thanks to @Henrik)
As @Henrik pointed out, what you're after is actually a rolling join of set_A
with respect to set_B
, in which case you don't need to deal with the duplicate rows.
That translates to
library(data.table)
setDT(set_A)
setDT(set_B)
# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]
set_A[set_B, on = "time", roll = "nearest"][order(ID_a)]
# ID_a a1 a2 a3 time_a time ID_b b1 b2 source
#1: 2 35694 5245.2 301.6053 0.00944 0.35889 28 90.2 52.114 set3.csv.1
#2: 3 85694 9278.9 301.6051 23.00972 23.35833 17 73.2 67.114 set2.csv.1
#3: 5 85653 4375.5 301.6047 19.00972 19.35861 23 88.2 42.114 set2.csv.2
#4: 6 12694 5236.3 301.6045 22.00972 22.35806 12 12.2 33.114 set1.csv.3
#5: 7 85697 5345.2 301.6043 21.00972 21.35778 7 67.2 16.114 set1.csv.2
#6: 9 30694 5279.0 301.6039 20.01000 20.35750 2 34.2 15.114 set1.csv.1
# time_b
#1: 0.35889
#2: 23.35833
#3: 19.35861
#4: 22.35806
#5: 21.35778
#6: 20.35750
Sample data
set_A <- read.table(text =
"ID_a a1 a2 a3 time_a
2 35694 5245.2 301.6053 00.00944
3 85694 9278.9 301.6051 23.00972
4 65694 9375.2 301.6049 22.00972
5 85653 4375.5 301.6047 19.00972
6 12694 5236.3 301.6045 22.00972
7 85697 5345.2 301.6043 21.00972
8 85640 5274.1 301.6041 20.01000
9 30694 5279.0 301.6039 20.01000", header = T)
set_B <- read.table(text =
"ID_b b1 b2 source time_b
2 34.20 15.114 set1.csv.1 20.35750
7 67.20 16.114 set1.csv.2 21.35778
12 12.20 33.114 set1.csv.3 22.35806
17 73.20 67.114 set2.csv.1 23.35833
23 88.20 42.114 set2.csv.2 19.35861
28 90.20 52.114 set3.csv.1 00.35889", header = T)
Rolling join with group ID in both tables
We can use a rolling join with data.table
library(data.table)
table_to_join[, date_ad := date_sale][original_table,
on = .(article, date_sale = date_ad), roll = -Inf]
# article date_sale date_ad
#1: A 2010-04-09 2010-12-15
#2: A 2011-07-12 2012-08-20
#3: B 2015-04-13 2016-01-05
#4: B 2016-08-12 2017-01-20
I think OP was confused with the naming of the output columns. Using an update by reference should be clearer:
table_to_join[, date_ad := date_sale]
original_table[, date_sale :=
table_to_join[.SD, on=.(article, date_ad), roll=-Inf, x.date_sale]
]
output:
article date_sale date_ad
1: A 2010-04-09 2010-12-15
2: A 2011-07-12 2012-08-20
3: A 2012-05-22 2012-08-20
4: B 2011-07-12 2013-12-01
5: B 2014-02-02 2016-01-05
6: B 2015-04-13 2016-01-05
7: B 2016-08-12 2017-01-20
Rolling join in data.table is using wrong time zone
I'm not sure what your exact problem is, but the rolling update join below seems to work fine when I look at your sample data sets...
NHflowDT <- setDT(NHflow)
NHchemDT <- setDT(NHchem)
NHchemDT[, c("nearest_flow_time", "Q.cms") :=
NHflowDT[NHchemDT, .(x.DateTime, x.Q.cms),
on = .(DateTime = SampleDateTime),
roll = "nearest"] ][]
output
SampleDateTime Parameter Value flow.EST flow.GMT nearest_flow_time Q.cms
1: 2012-03-07 15:15:00 HgD 11.83 1.734292 1.729082 2012-03-07 15:15:00 1.730031
2: 2012-03-07 07:57:00 HgD 9.09 1.730299 1.855115 2012-03-07 08:00:00 1.855263
3: 2012-03-07 09:00:00 HgD 8.98 1.711665 1.818062 2012-03-07 09:00:00 1.818062
4: 2012-03-07 08:31:00 HgD 10.91 1.721448 1.838623 2012-03-07 08:30:00 1.838623
5: 2012-03-07 08:00:00 HgD 10.91 1.729299 1.855263 2012-03-07 08:00:00 1.855263
> attributes( NHchemDT$SampleDateTime)
$class
[1] "POSIXct" "POSIXt"
$tzone
[1] "EST"
> attributes( NHchemDT$nearest_flow_time)
$class
[1] "POSIXct" "POSIXt"
$tzone
[1] "EST"
Related Topics
How to Assign the Result of the Previous Expression to a Variable
Using a Pre-Defined Color Palette in Ggplot
Read.Csv, Header on First Line, Skip Second Line
Update a Value in One Column Based on Criteria in Other Columns
How to Produce Stacked Bars Within Grouped Barchart in R
Determine the Data Types of a Data Frame's Columns
How to Wait for a Keypress in R
Options for Caching/Memoization/Hashing in R
How to Run R on a Server Without X11, and Avoid Broken Dependencies
Examples of the Perils of Globals in R and Stata
Argument Is of Length Zero in If Statement
Controlling Line Color and Line Type in Ggplot Legend