Join R Data.Tables Where Key Values Are Not Exactly Equal--Combine Rows with Closest Times

Join R data.tables where key values are not exactly equal--combine rows with closest times

Another option may be roll='nearest' (new in v1.8.8 on CRAN).

> setkey(DT1,x,time)
> DT1
x time v
1: a 10 1
2: a 30 2
3: a 60 3
4: b 10 4
5: b 30 5
6: b 60 6
7: c 10 7
8: c 30 8
9: c 60 9
> DT2
x time
1: a 17
2: b 54
3: c 3
> DT1[DT2,roll="nearest"]
x time v
1: a 17 1
2: b 54 6
3: c 3 7

Note that 17 appears to be closer to 10 than 30, hence the result in the first row.

If you need to roll to the next observation (next observation carried backwards) :

> DT1[DT2,roll=-Inf]
x time v
1: a 17 2
2: b 54 6
3: c 3 7

Join two data.tables on date, with closest date in table 1 strictly less than date in second table

Rolling join with -1L offset.

Update 2016-04-02: With this commit in current devel, v1.9.7, this can be done without creating a temporary column. From NEWS:

x's columns can be referred to in j using the prefix x. at all times. This is particularly useful when it is necessary to x's column that is also a join column. This is a patch addressing #1615.

Dt2[, jndate := date - 1L]
Dt1[Dt2,
.(date = i.date, orgdate = x.date, x),
on = c("date" = "jndate"),
roll = Inf]
# date orgdate x
#1: 2010-01-26 2010-01-25 9
#2: 2010-01-23 2010-01-22 7
#3: 2010-01-20 2010-01-19 11

Original answer, useful if you are on 1.9.6 or older.

library(data.table)

# data
Dt1 = fread("date x
1/26/2010, 10
1/25/2010, 9
1/24/2010, 9
1/22/2010, 7
1/19/2010, 11")[, date := as.IDate(date, format=("%m/%d/%Y"))][]
Dt2 = fread("date
1/26/2010
1/23/2010
1/20/2010")[, date := as.IDate(date, format=("%m/%d/%Y"))][]

# solution
offset.roll.join = function(Dt1, Dt2){
Dt2[, jndate := date - 1L] # produce join column with offset
Dt1[, orgdate := date] # should not be needed after data.table#1615
on.exit({Dt2[, jndate := NULL]; Dt1[, orgdate := NULL]}) # cleanup on exit
Dt1[Dt2, .(date = i.date, orgdate, x), on = c("date" = "jndate"), roll = Inf] # do rolling join
}
offset.roll.join(Dt1, Dt2)
# date orgdate x
#1: 2010-01-26 2010-01-25 9
#2: 2010-01-23 2010-01-22 7
#3: 2010-01-20 2010-01-19 11

Data.table: Join on ID and Date key, but want closest date before (or equal to ) date key in first table

It seems like you are looking for a simple rolling join here. First, we will convert to data.table objects (please note I'm using the latest version on CRAN for this solution (V 1.9.6+)

library(data.table) # V 1.9.6+
setDT(customer.table)
setDT(activity.table)

Then, per each row in customer.table we will try to join the closest value from activity.table while rolling to infinity

indx <- activity.table[customer.table, 
on = c(CustomerID = "CustomerID",
ActivityDate = "AsOfDate"),
roll = Inf,
which = TRUE]

indx
# [1] 51 19 48 52 49 44 35 36 45 34 5

indx is the locations vector of the dates in activity.table that are closest to each row in customer.table.

Now, all is left is to join back to customer.table

customer.table[, MostRecentDate := activity.table[indx,ActivityDate]]
customer.table
# CustomerID AsOfDate distance MostRecentDate
# 1: 1 2015-06-30 2.173805 2015-05-26 00:00:00
# 2: 2 2015-06-30 29.402483 2014-05-13 00:00:00
# 3: 3 2015-06-30 3.013533 2012-05-30 16:10:00
# 4: 4 2015-06-30 18.492314 2015-06-10 00:00:00
# 5: 5 2015-06-30 294.878607 2015-04-29 00:00:00
# 6: 6 2015-06-30 11.887021 2015-04-29 00:00:00
# 7: 7 2015-06-30 9.544386 2015-06-03 00:00:00
# 8: 8 2015-06-30 24.219203 2015-06-03 00:00:00
# 9: 9 2015-06-30 15.006934 2015-04-29 00:00:00
# 10: 10 2015-06-30 10.451366 2015-06-01 00:00:00
# 11: 4 2014-03-07 18.492314 2013-06-10 00:00:00

Left join two dataframes based on nearest timestamp in R

Here is a roundabout way using dplyr

library(dplyr)
df1 %>%
left_join(df2 %>%
left_join(df1) %>%
mutate(date_diff = abs(date2 - date1)) %>%
group_by(ID, date2) %>%
filter(date_diff == min(date_diff)) %>%
ungroup() %>%
select(-date2, -date_diff) ) %>%
mutate(Flag = case_when(is.na(Flag) ~ "No",
TRUE ~ Flag))
Joining, by = "ID"
Joining, by = c("ID", "date1")
# A tibble: 7 x 3
ID date1 Flag
<dbl> <dttm> <chr>
1 1 2020-07-11 19:14:23 Yes
2 1 2020-07-21 13:11:10 Yes
3 1 2020-07-21 18:07:25 No
4 1 2020-07-28 18:18:11 No
5 2 2020-07-13 16:47:26 Yes
6 2 2020-07-18 17:11:37 Yes
7 3 2020-07-23 10:39:19 No

r - data.table different behaviour with keys after merge and X[Y] joins

It looks to me like ?merge.data.table documents this for its sort argument:

If TRUE (default), the merged data.table is sorted by setting the key to the by / by.x columns. If FALSE, the result is not sorted.

Conditional merging with many variables

We may use mget to do this i.e. create object names as string with paste, then do a join on the 'ID' after subsetting the 'DT2' and assign the output of the columns (i.e. mget returns a list of vectors) to create new columns in 'DT1'

merge_variables = c("height", "weight", "V1", "V2") 
parent_nm <- paste0(merge_variables, "_parents")
child_nm <- paste0(merge_variables, "_children")

DT1[DT2[type == "Parents"], (parent_nm) := mget(paste0("i.", merge_variables)), on = .(ID)]
DT1[DT2[type == "Children"], (child_nm) := mget(paste0("i.", merge_variables)), on = .(ID)]

-output

> DT1
income ID height_parents weight_parents V1_parents V2_parents height_children weight_children V1_children V2_children
1: 0.26550866 1 0.20597457 0.4820801 0.47761962 0.6781459 0.1765568 0.5995658 0.86120948 1.6788807
2: 0.37212390 2 0.68702285 0.4935413 0.43809711 0.6933670 0.3841037 0.1862176 0.24479728 0.6675499
3: 0.57285336 3 0.76984142 0.8273733 0.07067905 0.9527025 0.4976992 0.6684667 0.09946616 1.7843967
4: 0.90820779 4 0.71761851 0.7942399 0.31627171 1.7286789 0.9919061 0.1079436 0.51863426 0.7799791
5: 0.20168193 5 0.38003518 0.7237109 0.66200508 1.5546414 0.7774452 0.4112744 0.40683019 1.9212360
6: 0.89838968 6 0.93470523 0.8209463 0.91287592 0.8693190 0.2121425 0.6470602 0.29360337 1.4250294
7: 0.94467527 7 0.65167377 0.7829328 0.45906573 0.7999887 0.1255551 0.5530363 0.33239467 0.6507043
8: 0.66079779 8 0.26722067 0.5297196 0.65087047 1.5141743 0.3861141 0.7893562 0.25801678 0.4053845
9: 0.62911404 9 0.01339033 0.0233312 0.47854525 1.4222424 0.3823880 0.4772301 0.76631067 0.2433838
10: 0.06178627 10 0.86969085 0.7323137 0.08424691 0.4909770 0.3403490 0.6927316 0.87532133 0.2866088

Or another option is to dcast the 'DT2' to 'wide' format and do a join (But, here we are not assigning (:=), thus it wouldn't update the original DT1)

DT1[dcast(DT2, ID ~ type, value.var = merge_variables), on = .(ID)]
income ID height_Children height_Parents weight_Children weight_Parents V1_Children V1_Parents V2_Children V2_Parents
1: 0.26550866 1 0.1765568 0.20597457 0.5995658 0.4820801 0.86120948 0.47761962 1.6788807 0.6781459
2: 0.37212390 2 0.3841037 0.68702285 0.1862176 0.4935413 0.24479728 0.43809711 0.6675499 0.6933670
3: 0.57285336 3 0.4976992 0.76984142 0.6684667 0.8273733 0.09946616 0.07067905 1.7843967 0.9527025
4: 0.90820779 4 0.9919061 0.71761851 0.1079436 0.7942399 0.51863426 0.31627171 0.7799791 1.7286789
5: 0.20168193 5 0.7774452 0.38003518 0.4112744 0.7237109 0.40683019 0.66200508 1.9212360 1.5546414
6: 0.89838968 6 0.2121425 0.93470523 0.6470602 0.8209463 0.29360337 0.91287592 1.4250294 0.8693190
7: 0.94467527 7 0.1255551 0.65167377 0.5530363 0.7829328 0.33239467 0.45906573 0.6507043 0.7999887
8: 0.66079779 8 0.3861141 0.26722067 0.7893562 0.5297196 0.25801678 0.65087047 0.4053845 1.5141743
9: 0.62911404 9 0.3823880 0.01339033 0.4772301 0.0233312 0.76631067 0.47854525 0.2433838 1.4222424
10: 0.06178627 10 0.3403490 0.86969085 0.6927316 0.7323137 0.87532133 0.08424691 0.2866088 0.4909770

Join datatables based on 2 values

You can use roll = "nearest" in joining dt1 and dt2:

library(data.table)

setkey(dt1, type1, value)
setkey(dt2, type1, value)

dt1[dt2, roll = "nearest"]

Output

    type1 type2 value
1: A K 0.3
2: A K 1.1
3: A I 2.6
4: A I 3.2
5: A I 4.0
6: B K 2.4
7: B K 5.1
8: B K 5.5
9: B I 6.7
10: B I 9.0

Mapping and assigning values between columns of different lengths using R

We could do it with fuzzyjoin to handle near matches:

library(fuzzyjoin)
library(dplyr)

fuzzy_left_join(df, df2, by = c("time"="time2"), match_fun = list(`<=`)) %>%
group_by(time2) %>%
slice(n()) %>%
ungroup() %>%
group_by(time) %>%
filter(row_number()==1) %>%
select(id3=id, time3=time, score3 = score2)
    id3 time3 score3
<dbl> <dbl> <dbl>
1 4375 0 0.028
2 4375 88 0.057
3 4375 96 0.057
4 4375 114 0.085


Related Topics



Leave a reply



Submit