Merge Nearest Date, and Related Variables from a Another Dataframe by Group

Merge nearest date, and related variables from a another dataframe by group

Here is the solution based on the base package:

z <- lapply(intersect(df1$ID,df2$ID),function(id) {
d1 <- subset(df1,ID==id)
d2 <- subset(df2,ID==id)

d1$indices <- sapply(d1$dateTarget,function(d) which.min(abs(d2$dateTarget - d)))
d2$indices <- 1:nrow(d2)

merge(d1,d2,by=c('ID','indices'))
})

z2 <- do.call(rbind,z)
z2$indices <- NULL

print(z2)

# ID dateTarget.x Value dateTarget.y ValueMatch
# 1 3 2015-11-14 47 2015-07-06 48
# 2 3 2015-12-08 98 2015-07-06 48
# 3 3 2015-02-22 52 2015-03-09 94
# 4 3 2014-11-17 68 2014-12-15 95
# 5 3 2013-05-30 91 2013-04-01 85
# 6 1 2013-11-04 70 2014-02-21 35
# 7 1 2014-12-29 18 2014-12-06 88
# 8 2 2013-01-14 52 2013-04-08 77
# 9 2 2015-07-29 97 2015-08-01 68
# 10 2 2015-06-15 98 2015-08-01 68

R merge two dataframes by closest date

Left join dfB to dfA, take the difference between dates per row and choose the smallest diff per id.

left_join(dfA, dfB, by = "id") %>%
mutate(date_diff = abs(Answer_Date.x - Answer_Date.y)) %>%
group_by(id) %>%
filter(date_diff == min(date_diff)) %>%
select(id, Answer_Date.x, Answer_Date.y, starts_with("x"), date_diff)

Then output is:

# A tibble: 2 x 8
# Groups: id [2]
id Answer_Date.x Answer_Date.y x1 x2 x3 x4 date_diff
<fct> <date> <date> <dbl> <dbl> <dbl> <dbl> <drtn>
1 Apple 2013-12-07 2013-12-05 1 10 5 50 2 days
2 Banana 2014-12-07 2014-12-10 2 20 3 30 3 days

By the way, in your sample code the third Answer_Date in the definition of dfB should be "2014-12-10" instead of "2015-12-10".

Merge two dataframes by nearest date in R

I suggest two approaches. The first uses a distance matrix and perform a left_join of df1 to df2. Namely the distance matrix is given by:

dateDist <- outer(pull(df1, date), pull(df2, date), "-") %>%
abs()

Next, for each row of df1, the row of df2 with closest distance is given by:

  closest.df1 <- apply(dateDist, 1, which.min)

Finally, the merge is performed manually:

cbind(rename_with(df1, ~paste0("df1.", "", .x)),
rename_with(df2[closest.df1,], ~paste0("df2.", "", .x)))

##>+ df1.date df1.value df2.date df2.value
##>1 2021-11-23 20:56:06 500 2021-11-23 20:55:47 Ship Emma
##>1.1 2021-11-23 20:56:07 900 2021-11-23 20:55:47 Ship Emma
##>1.2 2021-11-23 20:56:08 1000 2021-11-23 20:55:47 Ship Emma
##>1.3 2021-11-23 20:56:09 200 2021-11-23 20:55:47 Ship Emma
##>1.4 2021-11-23 20:56:10 300 2021-11-23 20:55:47 Ship Emma
##>1.5 2021-11-23 20:56:11 10 2021-11-23 20:55:47 Ship Emma
##>5 2021-11-23 22:13:56 1000 2021-11-23 22:16:01 Ship Amy
##>5.1 2021-11-23 22:13:57 450 2021-11-23 22:16:01 Ship Amy
##>5.2 2021-11-23 22:13:58 950 2021-11-23 22:16:01 Ship Amy
##>5.3 2021-11-23 22:13:59 600 2021-11-23 22:16:01 Ship Amy
##>12 2021-11-24 03:23:21 100 2021-11-24 03:23:37 Ship Sally
##>12.1 2021-11-24 03:23:22 750 2021-11-24 03:23:37 Ship Sally
##>12.2 2021-11-24 03:23:23 150 2021-11-24 03:23:37 Ship Sally
##>12.3 2021-11-24 03:23:24 200 2021-11-24 03:23:37 Ship Sally
##>12.4 2021-11-24 03:23:25 300 2021-11-24 03:23:37 Ship Sally
##>12.5 2021-11-24 03:24:34 400 2021-11-24 03:23:37 Ship Sally
##>12.6 2021-11-24 03:24:35 900 2021-11-24 03:23:37 Ship Sally
##>12.7 2021-11-24 03:24:36 1020 2021-11-24 03:23:37 Ship Sally
##>12.8 2021-11-24 03:24:37 800 2021-11-24 03:23:37 Ship Sally

The second approach involves first calculating the cartesian product of all the rows of df1 and df2 and then selecting only the rows with the minimum distance. The trick here is to use inner_join(..., by =character()) to get all the combinations of the two dataframes :

mutate(df1, id = row_number()) %>%
inner_join(mutate(df2, id = row_number()),by = character()) |>
mutate(dist = abs(date.x - date.y)) |>
group_by(id.x) |>
filter(dist == min(dist)) |>
select(-id.x, -id.y, -dist)

##>+ # A tibble: 19 × 7
##># Groups: id.x [19]
##> date.x value.x id.x date.y value.y id.y dist
##> <dttm> <dbl> <int> <dttm> <chr> <int> <drtn>
##> 1 2021-11-23 20:56:06 500 1 2021-11-23 20:55:47 Ship Emma 1 19 s…
##> 2 2021-11-23 20:56:07 900 2 2021-11-23 20:55:47 Ship Emma 1 20 s…
##> 3 2021-11-23 20:56:08 1000 3 2021-11-23 20:55:47 Ship Emma 1 21 s…
##> 4 2021-11-23 20:56:09 200 4 2021-11-23 20:55:47 Ship Emma 1 22 s…
##> 5 2021-11-23 20:56:10 300 5 2021-11-23 20:55:47 Ship Emma 1 23 s…
##> 6 2021-11-23 20:56:11 10 6 2021-11-23 20:55:47 Ship Emma 1 24 s…
##> 7 2021-11-23 22:13:56 1000 7 2021-11-23 22:16:01 Ship Amy 5 125 s…
##> 8 2021-11-23 22:13:57 450 8 2021-11-23 22:16:01 Ship Amy 5 124 s…
##> 9 2021-11-23 22:13:58 950 9 2021-11-23 22:16:01 Ship Amy 5 123 s…
##>10 2021-11-23 22:13:59 600 10 2021-11-23 22:16:01 Ship Amy 5 122 s…
##>11 2021-11-24 03:23:21 100 11 2021-11-24 03:23:37 Ship Sally 12 16 s…
##>12 2021-11-24 03:23:22 750 12 2021-11-24 03:23:37 Ship Sally 12 15 s…
##>13 2021-11-24 03:23:23 150 13 2021-11-24 03:23:37 Ship Sally 12 14 s…
##>14 2021-11-24 03:23:24 200 14 2021-11-24 03:23:37 Ship Sally 12 13 s…
##>15 2021-11-24 03:23:25 300 15 2021-11-24 03:23:37 Ship Sally 12 12 s…
##>16 2021-11-24 03:24:34 400 16 2021-11-24 03:23:37 Ship Sally 12 57 s…
##>17 2021-11-24 03:24:35 900 17 2021-11-24 03:23:37 Ship Sally 12 58 s…
##>18 2021-11-24 03:24:36 1020 18 2021-11-24 03:23:37 Ship Sally 12 59 s…
##>19 2021-11-24 03:24:37 800 19 2021-11-24 03:23:37 Ship Sally 12 60 s…

Joining two data frames on the closest date in R

You were almost there.

In the DT[i,on] syntax, i should be survey to join on all its rows

setDT(survey)
setDT(price)
survey_price <- price[survey,on=.(date=actual.date),roll="nearest"]
survey_price

date price.var1 price.var2 ID
<IDat> <num> <num> <int>
1: 2012-09-26 4.100958 4.147176 20120377
2: 2020-11-23 2.747339 2.739948 2020455822
3: 2012-10-26 4.100958 4.147176 20126758
4: 2012-10-25 4.100958 4.147176 20124241
5: 2020-11-28 2.747339 2.739948 2020426572

How to match by nearest date from two data frames?

One way is to use the roll=Inf feature from the data.table package as follows:

require(data.table)   ## >= 1.9.2
setDT(df1) ## convert to data.table by reference
setDT(df2) ## same

df1[, date := date1] ## create a duplicate of 'date1'
setkey(df1, date1) ## set the column to perform the join on
setkey(df2, date2) ## same as above

ans = df1[df2, roll=Inf] ## perform rolling join

## change names and set column order as required, by reference
setnames(ans, c('date','date1'), c('date1','date2'))
setcolorder(ans, c('epi', 'date1', 'bmi', 'date2'))

> ans
# epi date1 bmi date2
#1: 1 2014-01-08 33.57532 2014-01-08
#2: 2 2014-01-15 22.63604 2014-01-15
#3: 3 2014-01-26 22.22079 2014-01-28
#4: 4 2014-02-01 15.16691 2014-02-05
#5: 5 2014-02-15 27.48925 2014-02-24

Pandas Merge on Name and Closest Date

I'd also love to see the final solution you came up with to know how it shook out in the end.

One thing you can do to find the closest date might be something to calc the number of days between each date in the first DataFrame and the dates in the second DataFrame. Then you can use np.argmin to retrieve the date with the smallest time delta.

For example:

Setup

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO

Data

a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""

b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""

look at data

df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df1.head()

timepoint measure
0 2014-01-01 78
1 2014-01-02 29
2 2014-01-03 5
3 2014-01-04 73
4 2014-01-05 40

df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])
df2.head()

timepoint measure
0 2014-01-01 78
1 2014-01-08 29
2 2014-01-15 5
3 2014-01-22 73
4 2014-01-29 40

Func to find the closest date to a given date

def find_closest_date(timepoint, time_series, add_time_delta_column=True):
# takes a pd.Timestamp() instance and a pd.Series with dates in it
# calcs the delta between `timepoint` and each date in `time_series`
# returns the closest date and optionally the number of days in its time delta
deltas = np.abs(time_series - timepoint)
idx_closest_date = np.argmin(deltas)
res = {"closest_date": time_series.ix[idx_closest_date]}
idx = ['closest_date']
if add_time_delta_column:
res["closest_delta"] = deltas[idx_closest_date]
idx.append('closest_delta')
return pd.Series(res, index=idx)

df1[['closest', 'days_bt_x_and_y']] = df1.timepoint.apply(
find_closest_date, args=[df2.timepoint])
df1.head(10)

timepoint measure closest days_bt_x_and_y
0 2014-01-01 78 2014-01-01 0 days
1 2014-01-02 29 2014-01-01 1 days
2 2014-01-03 5 2014-01-01 2 days
3 2014-01-04 73 2014-01-01 3 days
4 2014-01-05 40 2014-01-08 3 days
5 2014-01-06 45 2014-01-08 2 days
6 2014-01-07 48 2014-01-08 1 days
7 2014-01-08 2 2014-01-08 0 days
8 2014-01-09 96 2014-01-08 1 days
9 2014-01-10 82 2014-01-08 2 days

Merge the two DataFrames on the new closest date column

df3 = pd.merge(df1, df2, left_on=['closest'], right_on=['timepoint'])

colorder = [
'timepoint_x',
'closest',
'timepoint_y',
'days_bt_x_and_y',
'measure_x',
'measure_y'
]

df3 = df3.ix[:, colorder]
df3

timepoint_x closest timepoint_y days_bt_x_and_y measure_x measure_y
0 2014-01-01 2014-01-01 2014-01-01 0 days 78 78
1 2014-01-02 2014-01-01 2014-01-01 1 days 29 78
2 2014-01-03 2014-01-01 2014-01-01 2 days 5 78
3 2014-01-04 2014-01-01 2014-01-01 3 days 73 78
4 2014-01-05 2014-01-08 2014-01-08 3 days 40 29
5 2014-01-06 2014-01-08 2014-01-08 2 days 45 29
6 2014-01-07 2014-01-08 2014-01-08 1 days 48 29
7 2014-01-08 2014-01-08 2014-01-08 0 days 2 29
8 2014-01-09 2014-01-08 2014-01-08 1 days 96 29
9 2014-01-10 2014-01-08 2014-01-08 2 days 82 29
10 2014-01-11 2014-01-08 2014-01-08 3 days 61 29
11 2014-01-12 2014-01-15 2014-01-15 3 days 68 5
12 2014-01-13 2014-01-15 2014-01-15 2 days 8 5
13 2014-01-14 2014-01-15 2014-01-15 1 days 94 5
14 2014-01-15 2014-01-15 2014-01-15 0 days 16 5
15 2014-01-16 2014-01-15 2014-01-15 1 days 31 5
16 2014-01-17 2014-01-15 2014-01-15 2 days 10 5
17 2014-01-18 2014-01-15 2014-01-15 3 days 34 5
18 2014-01-19 2014-01-22 2014-01-22 3 days 27 73
19 2014-01-20 2014-01-22 2014-01-22 2 days 58 73
20 2014-01-21 2014-01-22 2014-01-22 1 days 90 73
21 2014-01-22 2014-01-22 2014-01-22 0 days 41 73
22 2014-01-23 2014-01-22 2014-01-22 1 days 97 73
23 2014-01-24 2014-01-22 2014-01-22 2 days 7 73
24 2014-01-25 2014-01-22 2014-01-22 3 days 86 73
25 2014-01-26 2014-01-29 2014-01-29 3 days 62 40
26 2014-01-27 2014-01-29 2014-01-29 2 days 91 40
27 2014-01-28 2014-01-29 2014-01-29 1 days 0 40
28 2014-01-29 2014-01-29 2014-01-29 0 days 73 40
29 2014-01-30 2014-01-29 2014-01-29 1 days 22 40
30 2014-01-31 2014-01-29 2014-01-29 2 days 43 40
31 2014-02-01 2014-01-29 2014-01-29 3 days 87 40
32 2014-02-02 2014-02-05 2014-02-05 3 days 56 45
33 2014-02-03 2014-02-05 2014-02-05 2 days 45 45
34 2014-02-04 2014-02-05 2014-02-05 1 days 25 45
35 2014-02-05 2014-02-05 2014-02-05 0 days 92 45
36 2014-02-06 2014-02-05 2014-02-05 1 days 83 45
37 2014-02-07 2014-02-05 2014-02-05 2 days 13 45
38 2014-02-08 2014-02-05 2014-02-05 3 days 50 45
39 2014-02-09 2014-02-12 2014-02-12 3 days 48 48
40 2014-02-10 2014-02-12 2014-02-12 2 days 78 48

merging data frames based on multiple nearest matches in R

Without knowing exactly how you want the result formatted, you can do this with the data.table rolling join with roll="nearest" that you mentioned.

In this case I've melted both sets of data to long datasets so that the matching can be done in a single join.

library(data.table)
setDT(df1)
setDT(df2)

df1[
match(
melt(df1, id.vars="julian")[
melt(df2, measure.vars=names(df2)),
on=c("variable","value"), roll="nearest"]$julian,
julian),
]
# julian a b c d
#1: 9 12.02948 13.54714 7.659482 6.784113
#2: 20 28.74620 20.24871 18.523935 17.801711
#3: 10 13.00511 14.57352 8.296155 6.942622
#4: 24 30.26931 24.20554 20.253149 22.017714

If you want separate tables for each join instead you could do something like:

lapply(names(df2), \(var)  df1[df2, on=var, roll="nearest", .SD, .SDcols=names(df1)] )


Related Topics



Leave a reply



Submit