Merge Nearest Date, and Related Variables from a Another Dataframe by Group

Merge nearest date, and related variables from a another dataframe by group

Here is the solution based on the base package:

z <- lapply(intersect(df1$ID,df2$ID),function(id) {
   d1 <- subset(df1,ID==id)
   d2 <- subset(df2,ID==id)

   d1$indices <- sapply(d1$dateTarget,function(d) which.min(abs(d2$dateTarget - d)))
   d2$indices <- 1:nrow(d2)

   merge(d1,d2,by=c('ID','indices'))
  })

z2 <- do.call(rbind,z)
z2$indices <- NULL

print(z2)

#    ID dateTarget.x Value dateTarget.y ValueMatch
# 1   3   2015-11-14    47   2015-07-06         48
# 2   3   2015-12-08    98   2015-07-06         48
# 3   3   2015-02-22    52   2015-03-09         94
# 4   3   2014-11-17    68   2014-12-15         95
# 5   3   2013-05-30    91   2013-04-01         85
# 6   1   2013-11-04    70   2014-02-21         35
# 7   1   2014-12-29    18   2014-12-06         88
# 8   2   2013-01-14    52   2013-04-08         77
# 9   2   2015-07-29    97   2015-08-01         68
# 10  2   2015-06-15    98   2015-08-01         68

R merge two dataframes by closest date

Left join dfB to dfA, take the difference between dates per row and choose the smallest diff per id.

left_join(dfA, dfB, by = "id") %>%
  mutate(date_diff = abs(Answer_Date.x - Answer_Date.y)) %>%
  group_by(id) %>%
  filter(date_diff == min(date_diff)) %>%
  select(id, Answer_Date.x, Answer_Date.y, starts_with("x"), date_diff)

Then output is:

# A tibble: 2 x 8
# Groups:   id [2]
  id     Answer_Date.x Answer_Date.y    x1    x2    x3    x4 date_diff
  <fct>  <date>        <date>        <dbl> <dbl> <dbl> <dbl> <drtn>   
1 Apple  2013-12-07    2013-12-05        1    10     5    50 2 days   
2 Banana 2014-12-07    2014-12-10        2    20     3    30 3 days

By the way, in your sample code the third Answer_Date in the definition of dfB should be "2014-12-10" instead of "2015-12-10".

Merge two dataframes by nearest date in R

I suggest two approaches. The first uses a distance matrix and perform a left_join of df1 to df2. Namely the distance matrix is given by:

dateDist <- outer(pull(df1, date), pull(df2, date), "-") %>%
    abs()

Next, for each row of df1, the row of df2 with closest distance is given by:

  closest.df1 <- apply(dateDist, 1, which.min)

Finally, the merge is performed manually:

cbind(rename_with(df1, ~paste0("df1.", "", .x)),
      rename_with(df2[closest.df1,], ~paste0("df2.", "", .x)))

##>+                 df1.date df1.value            df2.date  df2.value
##>1    2021-11-23 20:56:06       500 2021-11-23 20:55:47  Ship Emma
##>1.1  2021-11-23 20:56:07       900 2021-11-23 20:55:47  Ship Emma
##>1.2  2021-11-23 20:56:08      1000 2021-11-23 20:55:47  Ship Emma
##>1.3  2021-11-23 20:56:09       200 2021-11-23 20:55:47  Ship Emma
##>1.4  2021-11-23 20:56:10       300 2021-11-23 20:55:47  Ship Emma
##>1.5  2021-11-23 20:56:11        10 2021-11-23 20:55:47  Ship Emma
##>5    2021-11-23 22:13:56      1000 2021-11-23 22:16:01   Ship Amy
##>5.1  2021-11-23 22:13:57       450 2021-11-23 22:16:01   Ship Amy
##>5.2  2021-11-23 22:13:58       950 2021-11-23 22:16:01   Ship Amy
##>5.3  2021-11-23 22:13:59       600 2021-11-23 22:16:01   Ship Amy
##>12   2021-11-24 03:23:21       100 2021-11-24 03:23:37 Ship Sally
##>12.1 2021-11-24 03:23:22       750 2021-11-24 03:23:37 Ship Sally
##>12.2 2021-11-24 03:23:23       150 2021-11-24 03:23:37 Ship Sally
##>12.3 2021-11-24 03:23:24       200 2021-11-24 03:23:37 Ship Sally
##>12.4 2021-11-24 03:23:25       300 2021-11-24 03:23:37 Ship Sally
##>12.5 2021-11-24 03:24:34       400 2021-11-24 03:23:37 Ship Sally
##>12.6 2021-11-24 03:24:35       900 2021-11-24 03:23:37 Ship Sally
##>12.7 2021-11-24 03:24:36      1020 2021-11-24 03:23:37 Ship Sally
##>12.8 2021-11-24 03:24:37       800 2021-11-24 03:23:37 Ship Sally

The second approach involves first calculating the cartesian product of all the rows of df1 and df2 and then selecting only the rows with the minimum distance. The trick here is to use inner_join(..., by =character()) to get all the combinations of the two dataframes :

mutate(df1, id = row_number()) %>%
    inner_join(mutate(df2, id = row_number()),by = character()) |>
    mutate(dist = abs(date.x - date.y)) |>
    group_by(id.x) |>
    filter(dist == min(dist)) |>
    select(-id.x, -id.y, -dist)

  ##>+ # A tibble: 19 × 7
  ##># Groups:   id.x [19]
  ##>   date.x              value.x  id.x date.y              value.y     id.y dist  
  ##>   <dttm>                <dbl> <int> <dttm>              <chr>      <int> <drtn>
  ##> 1 2021-11-23 20:56:06     500     1 2021-11-23 20:55:47 Ship Emma      1  19 s…
  ##> 2 2021-11-23 20:56:07     900     2 2021-11-23 20:55:47 Ship Emma      1  20 s…
  ##> 3 2021-11-23 20:56:08    1000     3 2021-11-23 20:55:47 Ship Emma      1  21 s…
  ##> 4 2021-11-23 20:56:09     200     4 2021-11-23 20:55:47 Ship Emma      1  22 s…
  ##> 5 2021-11-23 20:56:10     300     5 2021-11-23 20:55:47 Ship Emma      1  23 s…
  ##> 6 2021-11-23 20:56:11      10     6 2021-11-23 20:55:47 Ship Emma      1  24 s…
  ##> 7 2021-11-23 22:13:56    1000     7 2021-11-23 22:16:01 Ship Amy       5 125 s…
  ##> 8 2021-11-23 22:13:57     450     8 2021-11-23 22:16:01 Ship Amy       5 124 s…
  ##> 9 2021-11-23 22:13:58     950     9 2021-11-23 22:16:01 Ship Amy       5 123 s…
  ##>10 2021-11-23 22:13:59     600    10 2021-11-23 22:16:01 Ship Amy       5 122 s…
  ##>11 2021-11-24 03:23:21     100    11 2021-11-24 03:23:37 Ship Sally    12  16 s…
  ##>12 2021-11-24 03:23:22     750    12 2021-11-24 03:23:37 Ship Sally    12  15 s…
  ##>13 2021-11-24 03:23:23     150    13 2021-11-24 03:23:37 Ship Sally    12  14 s…
  ##>14 2021-11-24 03:23:24     200    14 2021-11-24 03:23:37 Ship Sally    12  13 s…
  ##>15 2021-11-24 03:23:25     300    15 2021-11-24 03:23:37 Ship Sally    12  12 s…
  ##>16 2021-11-24 03:24:34     400    16 2021-11-24 03:23:37 Ship Sally    12  57 s…
  ##>17 2021-11-24 03:24:35     900    17 2021-11-24 03:23:37 Ship Sally    12  58 s…
  ##>18 2021-11-24 03:24:36    1020    18 2021-11-24 03:23:37 Ship Sally    12  59 s…
  ##>19 2021-11-24 03:24:37     800    19 2021-11-24 03:23:37 Ship Sally    12  60 s…

Joining two data frames on the closest date in R

You were almost there.

In the DT[i,on] syntax, i should be survey to join on all its rows

setDT(survey)
setDT(price)
survey_price <- price[survey,on=.(date=actual.date),roll="nearest"]
survey_price

         date price.var1 price.var2         ID
       <IDat>      <num>      <num>      <int>
1: 2012-09-26   4.100958   4.147176   20120377
2: 2020-11-23   2.747339   2.739948 2020455822
3: 2012-10-26   4.100958   4.147176   20126758
4: 2012-10-25   4.100958   4.147176   20124241
5: 2020-11-28   2.747339   2.739948 2020426572

How to match by nearest date from two data frames?

One way is to use the roll=Inf feature from the data.table package as follows:

require(data.table)   ## >= 1.9.2
setDT(df1)            ## convert to data.table by reference
setDT(df2)            ## same

df1[, date := date1]  ## create a duplicate of 'date1'
setkey(df1, date1)    ## set the column to perform the join on
setkey(df2, date2)    ## same as above

ans = df1[df2, roll=Inf] ## perform rolling join

## change names and set column order as required, by reference
setnames(ans, c('date','date1'), c('date1','date2'))
setcolorder(ans, c('epi', 'date1', 'bmi', 'date2'))

> ans
#   epi      date1      bmi      date2
#1:   1 2014-01-08 33.57532 2014-01-08
#2:   2 2014-01-15 22.63604 2014-01-15
#3:   3 2014-01-26 22.22079 2014-01-28
#4:   4 2014-02-01 15.16691 2014-02-05
#5:   5 2014-02-15 27.48925 2014-02-24

Pandas Merge on Name and Closest Date

I'd also love to see the final solution you came up with to know how it shook out in the end.

One thing you can do to find the closest date might be something to calc the number of days between each date in the first DataFrame and the dates in the second DataFrame. Then you can use np.argmin to retrieve the date with the smallest time delta.

For example:

Setup

#!/usr/bin/env python
# -*- coding: utf-8 -*- 
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO

Data

a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""

b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""

look at data

df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df1.head()

   timepoint  measure
0 2014-01-01       78
1 2014-01-02       29
2 2014-01-03        5
3 2014-01-04       73
4 2014-01-05       40

df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])
df2.head()

   timepoint  measure
0 2014-01-01       78
1 2014-01-08       29
2 2014-01-15        5
3 2014-01-22       73
4 2014-01-29       40

Func to find the closest date to a given date

def find_closest_date(timepoint, time_series, add_time_delta_column=True):
    # takes a pd.Timestamp() instance and a pd.Series with dates in it
    # calcs the delta between `timepoint` and each date in `time_series`
    # returns the closest date and optionally the number of days in its time delta
    deltas = np.abs(time_series - timepoint)
    idx_closest_date = np.argmin(deltas)
    res = {"closest_date": time_series.ix[idx_closest_date]}
    idx = ['closest_date']
    if add_time_delta_column:
        res["closest_delta"] = deltas[idx_closest_date]
        idx.append('closest_delta')
    return pd.Series(res, index=idx)

df1[['closest', 'days_bt_x_and_y']] = df1.timepoint.apply(
                                          find_closest_date, args=[df2.timepoint])
df1.head(10)

   timepoint  measure    closest  days_bt_x_and_y
0 2014-01-01       78 2014-01-01           0 days
1 2014-01-02       29 2014-01-01           1 days
2 2014-01-03        5 2014-01-01           2 days
3 2014-01-04       73 2014-01-01           3 days
4 2014-01-05       40 2014-01-08           3 days
5 2014-01-06       45 2014-01-08           2 days
6 2014-01-07       48 2014-01-08           1 days
7 2014-01-08        2 2014-01-08           0 days
8 2014-01-09       96 2014-01-08           1 days
9 2014-01-10       82 2014-01-08           2 days

Merge the two DataFrames on the new closest date column

df3 = pd.merge(df1, df2, left_on=['closest'], right_on=['timepoint'])

colorder = [
    'timepoint_x',
    'closest',
    'timepoint_y',
    'days_bt_x_and_y',
    'measure_x',
    'measure_y'
]

df3 = df3.ix[:, colorder]
df3

   timepoint_x    closest timepoint_y  days_bt_x_and_y  measure_x  measure_y
0   2014-01-01 2014-01-01  2014-01-01           0 days         78         78
1   2014-01-02 2014-01-01  2014-01-01           1 days         29         78
2   2014-01-03 2014-01-01  2014-01-01           2 days          5         78
3   2014-01-04 2014-01-01  2014-01-01           3 days         73         78
4   2014-01-05 2014-01-08  2014-01-08           3 days         40         29
5   2014-01-06 2014-01-08  2014-01-08           2 days         45         29
6   2014-01-07 2014-01-08  2014-01-08           1 days         48         29
7   2014-01-08 2014-01-08  2014-01-08           0 days          2         29
8   2014-01-09 2014-01-08  2014-01-08           1 days         96         29
9   2014-01-10 2014-01-08  2014-01-08           2 days         82         29
10  2014-01-11 2014-01-08  2014-01-08           3 days         61         29
11  2014-01-12 2014-01-15  2014-01-15           3 days         68          5
12  2014-01-13 2014-01-15  2014-01-15           2 days          8          5
13  2014-01-14 2014-01-15  2014-01-15           1 days         94          5
14  2014-01-15 2014-01-15  2014-01-15           0 days         16          5
15  2014-01-16 2014-01-15  2014-01-15           1 days         31          5
16  2014-01-17 2014-01-15  2014-01-15           2 days         10          5
17  2014-01-18 2014-01-15  2014-01-15           3 days         34          5
18  2014-01-19 2014-01-22  2014-01-22           3 days         27         73
19  2014-01-20 2014-01-22  2014-01-22           2 days         58         73
20  2014-01-21 2014-01-22  2014-01-22           1 days         90         73
21  2014-01-22 2014-01-22  2014-01-22           0 days         41         73
22  2014-01-23 2014-01-22  2014-01-22           1 days         97         73
23  2014-01-24 2014-01-22  2014-01-22           2 days          7         73
24  2014-01-25 2014-01-22  2014-01-22           3 days         86         73
25  2014-01-26 2014-01-29  2014-01-29           3 days         62         40
26  2014-01-27 2014-01-29  2014-01-29           2 days         91         40
27  2014-01-28 2014-01-29  2014-01-29           1 days          0         40
28  2014-01-29 2014-01-29  2014-01-29           0 days         73         40
29  2014-01-30 2014-01-29  2014-01-29           1 days         22         40
30  2014-01-31 2014-01-29  2014-01-29           2 days         43         40
31  2014-02-01 2014-01-29  2014-01-29           3 days         87         40
32  2014-02-02 2014-02-05  2014-02-05           3 days         56         45
33  2014-02-03 2014-02-05  2014-02-05           2 days         45         45
34  2014-02-04 2014-02-05  2014-02-05           1 days         25         45
35  2014-02-05 2014-02-05  2014-02-05           0 days         92         45
36  2014-02-06 2014-02-05  2014-02-05           1 days         83         45
37  2014-02-07 2014-02-05  2014-02-05           2 days         13         45
38  2014-02-08 2014-02-05  2014-02-05           3 days         50         45
39  2014-02-09 2014-02-12  2014-02-12           3 days         48         48
40  2014-02-10 2014-02-12  2014-02-12           2 days         78         48

merging data frames based on multiple nearest matches in R

Without knowing exactly how you want the result formatted, you can do this with the data.table rolling join with roll="nearest" that you mentioned.

In this case I've melted both sets of data to long datasets so that the matching can be done in a single join.

library(data.table)
setDT(df1)
setDT(df2)

df1[
    match(
        melt(df1, id.vars="julian")[
            melt(df2, measure.vars=names(df2)),
            on=c("variable","value"), roll="nearest"]$julian,
        julian),
]
#   julian        a        b         c         d
#1:      9 12.02948 13.54714  7.659482  6.784113
#2:     20 28.74620 20.24871 18.523935 17.801711
#3:     10 13.00511 14.57352  8.296155  6.942622
#4:     24 30.26931 24.20554 20.253149 22.017714

If you want separate tables for each join instead you could do something like:

lapply(names(df2), \(var)  df1[df2, on=var, roll="nearest", .SD, .SDcols=names(df1)] )

Merge Nearest Date, and Related Variables from a Another Dataframe by Group