Join R data.tables where key values are not exactly equal--combine rows with closest times
Another option may be roll='nearest'
(new in v1.8.8 on CRAN).
> setkey(DT1,x,time)
> DT1
x time v
1: a 10 1
2: a 30 2
3: a 60 3
4: b 10 4
5: b 30 5
6: b 60 6
7: c 10 7
8: c 30 8
9: c 60 9
> DT2
x time
1: a 17
2: b 54
3: c 3
> DT1[DT2,roll="nearest"]
x time v
1: a 17 1
2: b 54 6
3: c 3 7
Note that 17 appears to be closer to 10 than 30, hence the result in the first row.
If you need to roll to the next observation (next observation carried backwards) :
> DT1[DT2,roll=-Inf]
x time v
1: a 17 2
2: b 54 6
3: c 3 7
Join two data.tables on date, with closest date in table 1 strictly less than date in second table
Rolling join with -1L
offset.
Update 2016-04-02: With this commit in current devel, v1.9.7, this can be done without creating a temporary column. From NEWS:
x's columns can be referred to in
j
using the prefixx.
at all times. This is particularly useful when it is necessary to x's column that is also a join column. This is a patch addressing #1615.
Dt2[, jndate := date - 1L]
Dt1[Dt2,
.(date = i.date, orgdate = x.date, x),
on = c("date" = "jndate"),
roll = Inf]
# date orgdate x
#1: 2010-01-26 2010-01-25 9
#2: 2010-01-23 2010-01-22 7
#3: 2010-01-20 2010-01-19 11
Original answer, useful if you are on 1.9.6 or older.
library(data.table)
# data
Dt1 = fread("date x
1/26/2010, 10
1/25/2010, 9
1/24/2010, 9
1/22/2010, 7
1/19/2010, 11")[, date := as.IDate(date, format=("%m/%d/%Y"))][]
Dt2 = fread("date
1/26/2010
1/23/2010
1/20/2010")[, date := as.IDate(date, format=("%m/%d/%Y"))][]
# solution
offset.roll.join = function(Dt1, Dt2){
Dt2[, jndate := date - 1L] # produce join column with offset
Dt1[, orgdate := date] # should not be needed after data.table#1615
on.exit({Dt2[, jndate := NULL]; Dt1[, orgdate := NULL]}) # cleanup on exit
Dt1[Dt2, .(date = i.date, orgdate, x), on = c("date" = "jndate"), roll = Inf] # do rolling join
}
offset.roll.join(Dt1, Dt2)
# date orgdate x
#1: 2010-01-26 2010-01-25 9
#2: 2010-01-23 2010-01-22 7
#3: 2010-01-20 2010-01-19 11
Data.table: Join on ID and Date key, but want closest date before (or equal to ) date key in first table
It seems like you are looking for a simple rolling join here. First, we will convert to data.table
objects (please note I'm using the latest version on CRAN for this solution (V 1.9.6+)
library(data.table) # V 1.9.6+
setDT(customer.table)
setDT(activity.table)
Then, per each row in customer.table
we will try to join the closest value from activity.table
while rolling to infinity
indx <- activity.table[customer.table,
on = c(CustomerID = "CustomerID",
ActivityDate = "AsOfDate"),
roll = Inf,
which = TRUE]
indx
# [1] 51 19 48 52 49 44 35 36 45 34 5
indx
is the locations vector of the dates in activity.table
that are closest to each row in customer.table
.
Now, all is left is to join back to customer.table
customer.table[, MostRecentDate := activity.table[indx,ActivityDate]]
customer.table
# CustomerID AsOfDate distance MostRecentDate
# 1: 1 2015-06-30 2.173805 2015-05-26 00:00:00
# 2: 2 2015-06-30 29.402483 2014-05-13 00:00:00
# 3: 3 2015-06-30 3.013533 2012-05-30 16:10:00
# 4: 4 2015-06-30 18.492314 2015-06-10 00:00:00
# 5: 5 2015-06-30 294.878607 2015-04-29 00:00:00
# 6: 6 2015-06-30 11.887021 2015-04-29 00:00:00
# 7: 7 2015-06-30 9.544386 2015-06-03 00:00:00
# 8: 8 2015-06-30 24.219203 2015-06-03 00:00:00
# 9: 9 2015-06-30 15.006934 2015-04-29 00:00:00
# 10: 10 2015-06-30 10.451366 2015-06-01 00:00:00
# 11: 4 2014-03-07 18.492314 2013-06-10 00:00:00
Left join two dataframes based on nearest timestamp in R
Here is a roundabout way using dplyr
library(dplyr)
df1 %>%
left_join(df2 %>%
left_join(df1) %>%
mutate(date_diff = abs(date2 - date1)) %>%
group_by(ID, date2) %>%
filter(date_diff == min(date_diff)) %>%
ungroup() %>%
select(-date2, -date_diff) ) %>%
mutate(Flag = case_when(is.na(Flag) ~ "No",
TRUE ~ Flag))
Joining, by = "ID"
Joining, by = c("ID", "date1")
# A tibble: 7 x 3
ID date1 Flag
<dbl> <dttm> <chr>
1 1 2020-07-11 19:14:23 Yes
2 1 2020-07-21 13:11:10 Yes
3 1 2020-07-21 18:07:25 No
4 1 2020-07-28 18:18:11 No
5 2 2020-07-13 16:47:26 Yes
6 2 2020-07-18 17:11:37 Yes
7 3 2020-07-23 10:39:19 No
r - data.table different behaviour with keys after merge and X[Y] joins
It looks to me like ?merge.data.table
documents this for its sort
argument:
If TRUE (default), the merged data.table is sorted by setting the key to the by / by.x columns. If FALSE, the result is not sorted.
Conditional merging with many variables
We may use mget
to do this i.e. create object names as string with paste
, then do a join on
the 'ID' after subsetting the 'DT2' and assign the output of the columns (i.e. mget
returns a list
of vectors) to create new columns in 'DT1'
merge_variables = c("height", "weight", "V1", "V2")
parent_nm <- paste0(merge_variables, "_parents")
child_nm <- paste0(merge_variables, "_children")
DT1[DT2[type == "Parents"], (parent_nm) := mget(paste0("i.", merge_variables)), on = .(ID)]
DT1[DT2[type == "Children"], (child_nm) := mget(paste0("i.", merge_variables)), on = .(ID)]
-output
> DT1
income ID height_parents weight_parents V1_parents V2_parents height_children weight_children V1_children V2_children
1: 0.26550866 1 0.20597457 0.4820801 0.47761962 0.6781459 0.1765568 0.5995658 0.86120948 1.6788807
2: 0.37212390 2 0.68702285 0.4935413 0.43809711 0.6933670 0.3841037 0.1862176 0.24479728 0.6675499
3: 0.57285336 3 0.76984142 0.8273733 0.07067905 0.9527025 0.4976992 0.6684667 0.09946616 1.7843967
4: 0.90820779 4 0.71761851 0.7942399 0.31627171 1.7286789 0.9919061 0.1079436 0.51863426 0.7799791
5: 0.20168193 5 0.38003518 0.7237109 0.66200508 1.5546414 0.7774452 0.4112744 0.40683019 1.9212360
6: 0.89838968 6 0.93470523 0.8209463 0.91287592 0.8693190 0.2121425 0.6470602 0.29360337 1.4250294
7: 0.94467527 7 0.65167377 0.7829328 0.45906573 0.7999887 0.1255551 0.5530363 0.33239467 0.6507043
8: 0.66079779 8 0.26722067 0.5297196 0.65087047 1.5141743 0.3861141 0.7893562 0.25801678 0.4053845
9: 0.62911404 9 0.01339033 0.0233312 0.47854525 1.4222424 0.3823880 0.4772301 0.76631067 0.2433838
10: 0.06178627 10 0.86969085 0.7323137 0.08424691 0.4909770 0.3403490 0.6927316 0.87532133 0.2866088
Or another option is to dcast
the 'DT2' to 'wide' format and do a join (But, here we are not assigning (:=
), thus it wouldn't update the original DT1)
DT1[dcast(DT2, ID ~ type, value.var = merge_variables), on = .(ID)]
income ID height_Children height_Parents weight_Children weight_Parents V1_Children V1_Parents V2_Children V2_Parents
1: 0.26550866 1 0.1765568 0.20597457 0.5995658 0.4820801 0.86120948 0.47761962 1.6788807 0.6781459
2: 0.37212390 2 0.3841037 0.68702285 0.1862176 0.4935413 0.24479728 0.43809711 0.6675499 0.6933670
3: 0.57285336 3 0.4976992 0.76984142 0.6684667 0.8273733 0.09946616 0.07067905 1.7843967 0.9527025
4: 0.90820779 4 0.9919061 0.71761851 0.1079436 0.7942399 0.51863426 0.31627171 0.7799791 1.7286789
5: 0.20168193 5 0.7774452 0.38003518 0.4112744 0.7237109 0.40683019 0.66200508 1.9212360 1.5546414
6: 0.89838968 6 0.2121425 0.93470523 0.6470602 0.8209463 0.29360337 0.91287592 1.4250294 0.8693190
7: 0.94467527 7 0.1255551 0.65167377 0.5530363 0.7829328 0.33239467 0.45906573 0.6507043 0.7999887
8: 0.66079779 8 0.3861141 0.26722067 0.7893562 0.5297196 0.25801678 0.65087047 0.4053845 1.5141743
9: 0.62911404 9 0.3823880 0.01339033 0.4772301 0.0233312 0.76631067 0.47854525 0.2433838 1.4222424
10: 0.06178627 10 0.3403490 0.86969085 0.6927316 0.7323137 0.87532133 0.08424691 0.2866088 0.4909770
Join datatables based on 2 values
You can use roll = "nearest"
in joining dt1
and dt2
:
library(data.table)
setkey(dt1, type1, value)
setkey(dt2, type1, value)
dt1[dt2, roll = "nearest"]
Output
type1 type2 value
1: A K 0.3
2: A K 1.1
3: A I 2.6
4: A I 3.2
5: A I 4.0
6: B K 2.4
7: B K 5.1
8: B K 5.5
9: B I 6.7
10: B I 9.0
Mapping and assigning values between columns of different lengths using R
We could do it with fuzzyjoin
to handle near matches:
library(fuzzyjoin)
library(dplyr)
fuzzy_left_join(df, df2, by = c("time"="time2"), match_fun = list(`<=`)) %>%
group_by(time2) %>%
slice(n()) %>%
ungroup() %>%
group_by(time) %>%
filter(row_number()==1) %>%
select(id3=id, time3=time, score3 = score2)
id3 time3 score3
<dbl> <dbl> <dbl>
1 4375 0 0.028
2 4375 88 0.057
3 4375 96 0.057
4 4375 114 0.085
Related Topics
Counting Number of Instances of a Condition Per Row R
Can Dplyr Join on Multiple Columns or Composite Key
How to Deal with "Data of Class Uneval" Error from Ggplot2
How to Determine If You Have an Internet Connection in R
Bigrams Instead of Single Words in Termdocument Matrix Using R and Rweka
Get Rid of \Addlinespace in Kable
Shiny: Passing Input$Var to Aes() in Ggplot2
Typeof Returns Integer for Something That Is Clearly a Factor
Merge Panel Data to Get Balanced Panel Data
Inputting Na Where There Are Missing Values When Scraping with Rvest
How to Stack Error Bars in a Stacked Bar Plot Using Geom_Errorbar
Stepwise Regression Using P-Values to Drop Variables with Nonsignificant P-Values
How to Use Multiple Versions of the Same R Package
Rmarkdown: How to End Tabbed Content
Use Ggpairs to Create This Plot
Use a Variable Within a Plotmath Expression