How to Match by Nearest Date from Two Data Frames

Joining two data frames in R, by nearest date and one other variable

You could add region to on condition:

price[survey,.(region,price,x.price_date,survey_date),on=.(region,price_date=survey_date),roll="nearest"][]

region price x.price_date survey_date
<char> <num> <Date> <Date>
1: A 0.88010416 1999-10-26 1999-09-05
2: B 0.31026217 1999-12-25 1999-12-31
3: A -1.68732684 1999-04-27 1999-06-02
4: C 0.00500695 1999-08-09 1999-05-11
5: C 0.00500695 1999-08-09 1999-03-24
6: B -0.03763026 1999-08-12 1999-09-02
7: C -0.64701901 1999-12-24 2000-01-01
8: B -0.03763026 1999-08-12 1999-08-06
9: C 0.00500695 1999-08-09 1999-08-03
10: A 0.88010416 1999-10-26 1999-11-22
11: C 1.37001035 1999-09-15 1999-10-03
12: B 0.01831663 1999-07-01 1999-06-18
13: A -0.62743621 1999-03-20 1999-03-12
14: B 0.72397606 1999-02-18 1999-03-02
15: C -0.64701901 1999-12-24 1999-12-18

Note use of x. to display LHS date.

Merge two dataframes by nearest date in R

I suggest two approaches. The first uses a distance matrix and perform a left_join of df1 to df2. Namely the distance matrix is given by:

dateDist <- outer(pull(df1, date), pull(df2, date), "-") %>%
abs()

Next, for each row of df1, the row of df2 with closest distance is given by:

  closest.df1 <- apply(dateDist, 1, which.min)

Finally, the merge is performed manually:

cbind(rename_with(df1, ~paste0("df1.", "", .x)),
rename_with(df2[closest.df1,], ~paste0("df2.", "", .x)))

##>+ df1.date df1.value df2.date df2.value
##>1 2021-11-23 20:56:06 500 2021-11-23 20:55:47 Ship Emma
##>1.1 2021-11-23 20:56:07 900 2021-11-23 20:55:47 Ship Emma
##>1.2 2021-11-23 20:56:08 1000 2021-11-23 20:55:47 Ship Emma
##>1.3 2021-11-23 20:56:09 200 2021-11-23 20:55:47 Ship Emma
##>1.4 2021-11-23 20:56:10 300 2021-11-23 20:55:47 Ship Emma
##>1.5 2021-11-23 20:56:11 10 2021-11-23 20:55:47 Ship Emma
##>5 2021-11-23 22:13:56 1000 2021-11-23 22:16:01 Ship Amy
##>5.1 2021-11-23 22:13:57 450 2021-11-23 22:16:01 Ship Amy
##>5.2 2021-11-23 22:13:58 950 2021-11-23 22:16:01 Ship Amy
##>5.3 2021-11-23 22:13:59 600 2021-11-23 22:16:01 Ship Amy
##>12 2021-11-24 03:23:21 100 2021-11-24 03:23:37 Ship Sally
##>12.1 2021-11-24 03:23:22 750 2021-11-24 03:23:37 Ship Sally
##>12.2 2021-11-24 03:23:23 150 2021-11-24 03:23:37 Ship Sally
##>12.3 2021-11-24 03:23:24 200 2021-11-24 03:23:37 Ship Sally
##>12.4 2021-11-24 03:23:25 300 2021-11-24 03:23:37 Ship Sally
##>12.5 2021-11-24 03:24:34 400 2021-11-24 03:23:37 Ship Sally
##>12.6 2021-11-24 03:24:35 900 2021-11-24 03:23:37 Ship Sally
##>12.7 2021-11-24 03:24:36 1020 2021-11-24 03:23:37 Ship Sally
##>12.8 2021-11-24 03:24:37 800 2021-11-24 03:23:37 Ship Sally

The second approach involves first calculating the cartesian product of all the rows of df1 and df2 and then selecting only the rows with the minimum distance. The trick here is to use inner_join(..., by =character()) to get all the combinations of the two dataframes :

mutate(df1, id = row_number()) %>%
inner_join(mutate(df2, id = row_number()),by = character()) |>
mutate(dist = abs(date.x - date.y)) |>
group_by(id.x) |>
filter(dist == min(dist)) |>
select(-id.x, -id.y, -dist)

##>+ # A tibble: 19 × 7
##># Groups: id.x [19]
##> date.x value.x id.x date.y value.y id.y dist
##> <dttm> <dbl> <int> <dttm> <chr> <int> <drtn>
##> 1 2021-11-23 20:56:06 500 1 2021-11-23 20:55:47 Ship Emma 1 19 s…
##> 2 2021-11-23 20:56:07 900 2 2021-11-23 20:55:47 Ship Emma 1 20 s…
##> 3 2021-11-23 20:56:08 1000 3 2021-11-23 20:55:47 Ship Emma 1 21 s…
##> 4 2021-11-23 20:56:09 200 4 2021-11-23 20:55:47 Ship Emma 1 22 s…
##> 5 2021-11-23 20:56:10 300 5 2021-11-23 20:55:47 Ship Emma 1 23 s…
##> 6 2021-11-23 20:56:11 10 6 2021-11-23 20:55:47 Ship Emma 1 24 s…
##> 7 2021-11-23 22:13:56 1000 7 2021-11-23 22:16:01 Ship Amy 5 125 s…
##> 8 2021-11-23 22:13:57 450 8 2021-11-23 22:16:01 Ship Amy 5 124 s…
##> 9 2021-11-23 22:13:58 950 9 2021-11-23 22:16:01 Ship Amy 5 123 s…
##>10 2021-11-23 22:13:59 600 10 2021-11-23 22:16:01 Ship Amy 5 122 s…
##>11 2021-11-24 03:23:21 100 11 2021-11-24 03:23:37 Ship Sally 12 16 s…
##>12 2021-11-24 03:23:22 750 12 2021-11-24 03:23:37 Ship Sally 12 15 s…
##>13 2021-11-24 03:23:23 150 13 2021-11-24 03:23:37 Ship Sally 12 14 s…
##>14 2021-11-24 03:23:24 200 14 2021-11-24 03:23:37 Ship Sally 12 13 s…
##>15 2021-11-24 03:23:25 300 15 2021-11-24 03:23:37 Ship Sally 12 12 s…
##>16 2021-11-24 03:24:34 400 16 2021-11-24 03:23:37 Ship Sally 12 57 s…
##>17 2021-11-24 03:24:35 900 17 2021-11-24 03:23:37 Ship Sally 12 58 s…
##>18 2021-11-24 03:24:36 1020 18 2021-11-24 03:23:37 Ship Sally 12 59 s…
##>19 2021-11-24 03:24:37 800 19 2021-11-24 03:23:37 Ship Sally 12 60 s…

How to match by nearest date from two data frames?

One way is to use the roll=Inf feature from the data.table package as follows:

require(data.table)   ## >= 1.9.2
setDT(df1) ## convert to data.table by reference
setDT(df2) ## same

df1[, date := date1] ## create a duplicate of 'date1'
setkey(df1, date1) ## set the column to perform the join on
setkey(df2, date2) ## same as above

ans = df1[df2, roll=Inf] ## perform rolling join

## change names and set column order as required, by reference
setnames(ans, c('date','date1'), c('date1','date2'))
setcolorder(ans, c('epi', 'date1', 'bmi', 'date2'))

> ans
# epi date1 bmi date2
#1: 1 2014-01-08 33.57532 2014-01-08
#2: 2 2014-01-15 22.63604 2014-01-15
#3: 3 2014-01-26 22.22079 2014-01-28
#4: 4 2014-02-01 15.16691 2014-02-05
#5: 5 2014-02-15 27.48925 2014-02-24

Joining two data frames on the closest date in R

You were almost there.

In the DT[i,on] syntax, i should be survey to join on all its rows

setDT(survey)
setDT(price)
survey_price <- price[survey,on=.(date=actual.date),roll="nearest"]
survey_price

date price.var1 price.var2 ID
<IDat> <num> <num> <int>
1: 2012-09-26 4.100958 4.147176 20120377
2: 2020-11-23 2.747339 2.739948 2020455822
3: 2012-10-26 4.100958 4.147176 20126758
4: 2012-10-25 4.100958 4.147176 20124241
5: 2020-11-28 2.747339 2.739948 2020426572

Matching nearest date between two data.frames

You could convert character dates to Date and use roll='nearest':

setDT(df)            ## convert to data.table by reference
setDT(ONS) ## same

df[, date := as.Date(date_of_sampling)] ## create a duplicate of 'df'
setkey(df, date) ## set the column to perform the join on
ONS[, date := as.Date(sample_date_midpoint)] ## create a duplicate of 'ONS'
setkey(ONS, date) ## same as above

ONS[df, roll='nearest'][
abs(difftime(sample_date_midpoint,date_of_sampling,unit='day'))<5]

# Key: <date>
# sample_date_midpoint prevalence date date_of_sampling operator num_passengers
# <char> <num> <Date> <char> <char> <num>
# 1: 2021-01-01 0.1964160 2021-01-01 2021-01-01 A 204
# 2: 2021-01-01 0.1964160 2021-01-03 2021-01-03 B 100
# 3: 2021-02-05 0.3906553 2021-02-03 2021-02-03 B 400
# 4: 2021-02-05 0.3906553 2021-02-04 2021-02-04 A 155

Merge nearest date, and related variables from a another dataframe by group

Here is the solution based on the base package:

z <- lapply(intersect(df1$ID,df2$ID),function(id) {
d1 <- subset(df1,ID==id)
d2 <- subset(df2,ID==id)

d1$indices <- sapply(d1$dateTarget,function(d) which.min(abs(d2$dateTarget - d)))
d2$indices <- 1:nrow(d2)

merge(d1,d2,by=c('ID','indices'))
})

z2 <- do.call(rbind,z)
z2$indices <- NULL

print(z2)

# ID dateTarget.x Value dateTarget.y ValueMatch
# 1 3 2015-11-14 47 2015-07-06 48
# 2 3 2015-12-08 98 2015-07-06 48
# 3 3 2015-02-22 52 2015-03-09 94
# 4 3 2014-11-17 68 2014-12-15 95
# 5 3 2013-05-30 91 2013-04-01 85
# 6 1 2013-11-04 70 2014-02-21 35
# 7 1 2014-12-29 18 2014-12-06 88
# 8 2 2013-01-14 52 2013-04-08 77
# 9 2 2015-07-29 97 2015-08-01 68
# 10 2 2015-06-15 98 2015-08-01 68

merging data frames based on multiple nearest matches in R

Without knowing exactly how you want the result formatted, you can do this with the data.table rolling join with roll="nearest" that you mentioned.

In this case I've melted both sets of data to long datasets so that the matching can be done in a single join.

library(data.table)
setDT(df1)
setDT(df2)

df1[
match(
melt(df1, id.vars="julian")[
melt(df2, measure.vars=names(df2)),
on=c("variable","value"), roll="nearest"]$julian,
julian),
]
# julian a b c d
#1: 9 12.02948 13.54714 7.659482 6.784113
#2: 20 28.74620 20.24871 18.523935 17.801711
#3: 10 13.00511 14.57352 8.296155 6.942622
#4: 24 30.26931 24.20554 20.253149 22.017714

If you want separate tables for each join instead you could do something like:

lapply(names(df2), \(var)  df1[df2, on=var, roll="nearest", .SD, .SDcols=names(df1)] )

Python - Join two dataframes based on closest date match and additional column

My output is a bit different in merge_asof method:

df1['alert_dt'] = pd.to_datetime(df1['alert_dt'], dayfirst=True)
df2['inv_dt'] = pd.to_datetime(df2['inv_dt'], dayfirst=True)


df = pd.merge_asof(df2.sort_values('inv_dt'),
df1.sort_values('alert_dt'),
left_on='inv_dt',
right_on='alert_dt',
by='pty')
print (df)
inv_dt pty alert_dt
0 2020-06-07 A 2020-06-01
1 2020-06-14 A 2020-06-12
2 2020-06-27 A 2020-06-27
3 2020-07-12 B 2020-06-12
4 2020-08-15 B 2020-07-15


Related Topics



Leave a reply



Submit