Rolling join grouped by a second variable in data.table
You were almost there.
You can join on multiple columns simultaneously. So, in addition to "Date", you can include "Field" in the on
clause. But please note the description of the roll
argument in ?data.table
:
Rolling joins apply to the last join column
Thus, for "Date" to be used for the rolling join, specify it as the last variable in on
:
library(data.table)
d1[d2, roll = "nearest", on = .(Field, Date)]
For better verification, the result can be ordered
d1[d2, roll = "nearest", on = .(Field, Date)][order(Field, Date)]
Field Date NlbsAcre TotN
1: 12S 2016-05-24 NA 208.62194
2: 12S 2016-05-27 NA 172.57658
3: 12S 2016-07-31 NA 318.97092
4: 12S 2016-08-18 NA 428.54011
5: 12S 2016-08-29 NA 393.81545
6: 12S 2017-03-13 44728.184 145.15091
7: 12S 2017-03-16 44728.184 128.14334
8: 12S 2017-08-01 12621.083 132.72365
9: 12S 2017-08-04 12621.083 422.63032
10: 12S 2017-08-14 12621.083 337.91388
11: 12S 2017-10-04 22162.203 692.15276
12: 19-1 2016-05-01 12630.923 476.17492
13: 19-1 2016-08-15 12630.923 110.70600
14: 19-1 2016-09-10 12630.923 215.88105
15: 19-1 2016-09-19 12630.923 224.68906
16: 19-1 2016-12-16 12630.923 338.59349
17: 19-1 2017-01-13 12630.923 305.35394
18: 19-1 2017-03-27 12630.923 435.04925
19: 19-1 2017-05-30 12630.923 818.80997
20: 6 2016-05-05 NA 102.53240
21: 6 2016-06-14 NA 149.06045
22: 6 2016-06-29 NA 125.82803
23: 6 2016-06-29 NA 125.82803
24: 6 2016-07-11 NA 79.24480
25: 6 2016-07-25 NA 62.24449
26: 6 2016-08-25 NA 75.77014
27: 6 2017-01-03 2014.772 47.49660
28: 6 2017-01-12 2014.772 45.53730
29: 6 2017-01-17 2014.772 43.92222
30: 6 2017-02-11 3082.318 21.96791
31: 6 2017-03-19 2477.083 21.39367
32: 6 2017-04-17 2427.536 79.03807
33: 6 2017-07-12 NA 103.52417
34: 6 2017-07-17 NA 65.53112
35: 6 2017-09-06 NA 47.40618
36: 7 2016-06-02 NA 147.49353
37: 7 2016-07-11 NA 59.26973
38: 7 2016-08-04 NA 72.62146
39: 7 2016-08-30 NA 58.27003
40: 7 2016-08-30 NA 58.27003
41: 7 2016-10-30 NA 73.88811
42: 7 2017-02-11 2279.609 21.07551
43: 7 2017-02-22 2279.609 19.92023
44: 7 2017-03-19 15842.916 31.71433
45: 7 2017-05-17 NA 44.96872
46: 7 2017-07-17 NA 58.53364
47: W62 2016-05-05 16764.975 96.72854
48: W62 2016-05-31 16764.975 72.96954
49: W62 2016-08-31 16764.975 86.33588
50: W62 2016-12-05 16764.975 94.19370
51: W62 2017-01-02 18874.656 119.39040
52: W62 2017-02-22 18874.656 75.46591
Field Date NlbsAcre TotN
Rolling join in Data.Table by two variables without creating duplicates
First of all, you could use unique
instead of distinct
(the latter presumably from dplyr
; you don't specify)
to avoid coercing the data table to a data frame.
You were pretty close,
but you need to switch the tables in the join,
i.e. something like df2[df1]
,
so that the rows from df1
are used as search keys,
and then you can use mult
to remove duplicates.
Here's one way to do what you want with a non-equi join:
setkey(df1, departure)
setkey(df2, departure)
df1[, max_departure := departure + as.difftime(1, units = "hours")
][, observed_departure := df2[df1,
x.departure,
on = .(stop_id, departure >= departure, departure <= max_departure),
mult = "first"]
][, max_departure := NULL]
We order by departure
(via setkey
) so that mult = "first"
returns the closest match in the future within what's allowed.
The intermediate column max_departure
has to be assigned and subsequently removed because non-equi joins can only use existing columns.
Also note that the syntax used takes from this answer
(the version with .SD
instead of df1
doesn't work in this case,
I don't know why).
EDIT: based on the comments,
it occurs to me that when you say "duplicated",
you might be referring to something different.
Say you have planned departures at 10 and 10:30,
but the one at 10 never takes place,
and an observed departure is 10:31.
Perhaps you mean that 10:31 is the observed departure for the one scheduled at 10:30,
and cannot be used for the one at 10?
If that's the case,
perhaps this will work:
setkey(df1, departure)
setkey(df2, departure)
max_dep <- function(departure) {
max_departure <- departure + as.difftime(1, units = "hours")
next_departure <- shift(departure,
fill = max_departure[length(max_departure)] + as.difftime(1, units = "secs"),
type = "lead")
invalid_max <- max_departure >= next_departure
max_departure[invalid_max] <- next_departure[invalid_max] - as.difftime(1, units = "secs")
max_departure
}
df1[, max_departure := max_dep(departure), by = "stop_id"
][, observed_departure := df2[df1,
x.departure,
on = .(stop_id, departure >= departure, departure <= max_departure),
mult = "first"]
][, max_departure := NULL]
The max_dep
helper checks,
for each stop and scheduled departure,
what would be the next scheduled departure,
and sets max_departure
as "next minus 1 second" if the next departure is within one hour.
The other solution wouldn't work for this because,
as long as an observed departure falls within one hour of the scheduled one,
it is a valid option.
In my example that means 10:31 would be valid for both 10:30 and 10.
Rolling join two data.tables with date in R
We may use non-equi join
dt1[dt2, date_2 := date2, on = .(group, date1 > date2), mult = "first"]
R rolling join two data.tables with error margin on join
A data.table
answer has been given here by user Uwe:
https://stackoverflow.com/a/62321710/12079387
Merging two sets of data by data.table roll='nearest' function
Here is a step-by-step example based on the sample data you give:
# Sample data
library(data.table)
setDT(set_A)
setDT(set_B)
# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]
setkey(set_A, time)
setkey(set_B, time)
# Rolling join by nearest time
set_merged <- set_B[set_A, roll = "nearest"]
unique(set_merged[order(ID_b)], by = "time")
# ID_b b1 b2 source time_b time ID_a a1 a2 a3
# 1: 2 34.2 15.114 set1.csv.1 20.35750 20.01000 8 85640 5274.1 301.6041
# 2: 7 67.2 16.114 set1.csv.2 21.35778 21.00972 7 85697 5345.2 301.6043
# 3: 12 12.2 33.114 set1.csv.3 22.35806 22.00972 4 65694 9375.2 301.6049
# 4: 17 73.2 67.114 set2.csv.1 23.35833 23.00972 3 85694 9278.9 301.6051
# 5: 23 88.2 42.114 set2.csv.2 19.35861 19.00972 5 85653 4375.5 301.6047
# 6: 28 90.2 52.114 set3.csv.1 0.35889 0.00944 2 35694 5245.2 301.6053
# time_a
# 1: 20.01000
# 2: 21.00972
# 3: 22.00972
# 4: 23.00972
# 5: 19.00972
# 6: 0.00944
Two comments:
- We create a new
time
column to avoid losing one of the original time columns fromset_A
andset_B
. You can always remove thetime
column after the join if required. - We use
unique
to remove duplicatedtime
rows by order ofID_b
. You mention in your post that "it doesn't really matter which row will be merged" but in case that you do want to retain specific rows, you might need to adjust this line of code.
Update (thanks to @Henrik)
As @Henrik pointed out, what you're after is actually a rolling join of set_A
with respect to set_B
, in which case you don't need to deal with the duplicate rows.
That translates to
library(data.table)
setDT(set_A)
setDT(set_B)
# Create time column by which to do a rolling join
set_A[, time := time_a]
set_B[, time := time_b]
set_A[set_B, on = "time", roll = "nearest"][order(ID_a)]
# ID_a a1 a2 a3 time_a time ID_b b1 b2 source
#1: 2 35694 5245.2 301.6053 0.00944 0.35889 28 90.2 52.114 set3.csv.1
#2: 3 85694 9278.9 301.6051 23.00972 23.35833 17 73.2 67.114 set2.csv.1
#3: 5 85653 4375.5 301.6047 19.00972 19.35861 23 88.2 42.114 set2.csv.2
#4: 6 12694 5236.3 301.6045 22.00972 22.35806 12 12.2 33.114 set1.csv.3
#5: 7 85697 5345.2 301.6043 21.00972 21.35778 7 67.2 16.114 set1.csv.2
#6: 9 30694 5279.0 301.6039 20.01000 20.35750 2 34.2 15.114 set1.csv.1
# time_b
#1: 0.35889
#2: 23.35833
#3: 19.35861
#4: 22.35806
#5: 21.35778
#6: 20.35750
Sample data
set_A <- read.table(text =
"ID_a a1 a2 a3 time_a
2 35694 5245.2 301.6053 00.00944
3 85694 9278.9 301.6051 23.00972
4 65694 9375.2 301.6049 22.00972
5 85653 4375.5 301.6047 19.00972
6 12694 5236.3 301.6045 22.00972
7 85697 5345.2 301.6043 21.00972
8 85640 5274.1 301.6041 20.01000
9 30694 5279.0 301.6039 20.01000", header = T)
set_B <- read.table(text =
"ID_b b1 b2 source time_b
2 34.20 15.114 set1.csv.1 20.35750
7 67.20 16.114 set1.csv.2 21.35778
12 12.20 33.114 set1.csv.3 22.35806
17 73.20 67.114 set2.csv.1 23.35833
23 88.20 42.114 set2.csv.2 19.35861
28 90.20 52.114 set3.csv.1 00.35889", header = T)
R data.table rolling join mult not working as expected
When running dt2[dt1, roll = "nearest"]
you are basically saying "return the rows from dt2
according to the nearest join to each row in dt1
using the key. So
dates2
in row one indt2
is the nearest todates1
in row one indt1
dates2
in row one indt2
is the nearest todates1
in row two indt1
dates2
in row two indt2
is the nearest todates1
in row three indt1
Hence,
dt2[dt1, roll = "nearest"]
# dates2 values2 values1
# 1: 2015-10-26 12:00:00 A a
# 2: 2015-10-26 13:00:00 A b
# 3: 2015-10-26 14:00:00 C c
Which are all the rows from dt1
with the joined values2
from dt2
.
Instead, we want to join the other way around, namely "extract values2
from dt2
according to the nearest join by each row in dt2
using the key and update the matched rows in dt1
", namely
dt1[dt2, roll = "nearest", values2 := i.values2]
dt1
# dates1 values1 values2
# 1: 2015-10-26 12:00:00 a A
# 2: 2015-10-26 13:00:00 b NA
# 3: 2015-10-26 14:00:00 c C
Some additional notes
- You don't need to wrap first to
data.frame
and then todata.table
, you can just dodt1 <- data.table(dates1, values1)
and etc. - While you at it, you can already set the key on the fly using
key
parameterdata.table
, namelydt1 <- data.table(dates1, values1, key = "dates1")
and etc. - Or you can skip setting keys all together and use
on
instead (V 1.9.6+), namelydt1[dt2, roll = "nearest", values2 := i.values2, on = c(dates1 = "dates2")]
- Finally, please refrain from making unnecessary copies, e.g., instead of
<-
anddata.table(df)
use:=
andsetDT(df)
, see here for more information
Nearest n rolling join in R data table
Here is something very raw (we go row by row):
n <- 2L
sen <- 1L:n
for (i in 1:nrow(dt1)) {
set(dt1, i, j = "nearest", list(which(frank(abs(dt1$x[i] - dt2$x)) %in% sen)))
}
dt1[, .(id1, nearest = unlist(nearest)), by = x
][, id2 := dt2$id2[nearest]
][, roll := paste0("nr", frank(abs(dt2$x[nearest] - x))), by = x][]
# x id1 nearest id2 roll
# 1: 15 x 1 a nr1
# 2: 15 x 2 b nr2
# 3: 101 y 2 b nr2
# 4: 101 y 3 c nr1
Slightly cleaner:
dt1[,
{
nrank <- frank(abs(x - dt2$x), ties.method="first")
nearest <- which(nrank %in% sen)
.(x = x, id2 = dt2$id2[nearest], roll = paste0("nr", nrank[nearest]))
},
by = id1] # assumes unique ids.
Data:
dt1 <- data.table(x = c(15, 101), id1 = c("x", "y"))
dt2 <- data.table(x = c(10, 50, 100, 200), id2 = c("a", "b", "c", "d"))
EDIT (as suggested/written by OP)
Joining with multiple keys:
dt1[,
{
g <- group
dt_tmp <- dt2[dt2$group == g]
nrank <- frank(abs(x - dt_tmp$x), ties.method="first")
nearest <- which(nrank %in% sen)
.(x = x, id2 = dt_tmp$id2[nearest], roll = paste0("nr", nrank[nearest]))
},
by = id1]
Unable to perform rolling join on time and another variable in data.table
This seemed to solve the problem
test = df2[df1, on=.(bike_id, time = time), roll="nearest",
.(bike_id, time, st_x, st_y)]
rolling joins data.table in R
That quote from the documentation appears to be from FAQ 1.12 What is the difference between X[Y] and merge(X,Y). Did you find the following in ?data.table
and does it help?
roll
Applies to the last join column, generally a date but can be any
ordered variable, irregular and including gaps. If roll=TRUE and i's
row matches to all but the last x join column, and its value in the
last i join column falls in a gap (including after the last
observation in x for that group), then the prevailing value in x is
rolled forward. This operation is particularly fast using a modified
binary search. The operation is also known as last observation carried
forward (LOCF). Usually, there should be no duplicates in x's key, the
last key column is a date (or time, or datetime) and all the columns
of x's key are joined to. A common idiom is to select a
contemporaneous regular time series (dts) across a set of identifiers
(ids): DT[CJ(ids,dts),roll=TRUE] where DT has a 2-column key (id,date)
and CJ stands for cross join.
rolltolast
Like roll but the data is not rolled forward past the last
observation within each group defined by the join columns. The value
of i must fall in a gap in x but not after the end of the data, for
that group defined by all but the last join column. roll and
rolltolast may not both be TRUE.
In terms of left/right analogies to SQL joins, I prefer to think about that in the context of FAQ 2.14 Can you explain further why data.table is inspired by A[B] syntax
in base. That's quite a long answer so I won't paste it here.
Related Topics
R Converting from Datetime to Date
Split or Separate Uneven/Unequal Strings with No Delimiter
Write.Table Writes Unwanted Leading Empty Column to Header When Has Rownames
How to Label Histogram Bars with Data Values or Percents in R
Can Not Connect Postgresql(Over Ssl) with Rpostgresql on Windows
How to Use Tidyr to Fill in Completed Rows Within Each Value of a Grouping Variable
Lm(): What Is Qraux Returned by Qr Decomposition in Linpack/Lapack
Indexing Integer Vector with Na
Ggplot2: Creating Themed Title, Subtitle with Cowplot
R Ggplot2: Using Stat_Summary (Mean) and Logarithmic Scale
Removing One Table from Another in R
Developing Shiny App as a Package and Deploying It to Shiny Server