﻿ How to Match by Nearest Date from Two Data Frames - ITCodar

# How to Match by Nearest Date from Two Data Frames

## Joining two data frames in R, by nearest date and one other variable

You could add region to `on` condition:

``price[survey,.(region,price,x.price_date,survey_date),on=.(region,price_date=survey_date),roll="nearest"][]    region       price x.price_date survey_date    <char>       <num>       <Date>      <Date> 1:      A  0.88010416   1999-10-26  1999-09-05 2:      B  0.31026217   1999-12-25  1999-12-31 3:      A -1.68732684   1999-04-27  1999-06-02 4:      C  0.00500695   1999-08-09  1999-05-11 5:      C  0.00500695   1999-08-09  1999-03-24 6:      B -0.03763026   1999-08-12  1999-09-02 7:      C -0.64701901   1999-12-24  2000-01-01 8:      B -0.03763026   1999-08-12  1999-08-06 9:      C  0.00500695   1999-08-09  1999-08-0310:      A  0.88010416   1999-10-26  1999-11-2211:      C  1.37001035   1999-09-15  1999-10-0312:      B  0.01831663   1999-07-01  1999-06-1813:      A -0.62743621   1999-03-20  1999-03-1214:      B  0.72397606   1999-02-18  1999-03-0215:      C -0.64701901   1999-12-24  1999-12-18``

Note use of `x.` to display LHS date.

## Merge two dataframes by nearest date in R

I suggest two approaches. The first uses a distance matrix and perform a left_join of df1 to df2. Namely the distance matrix is given by:

``dateDist <- outer(pull(df1, date), pull(df2, date), "-") %>%    abs()``

Next, for each row of `df1`, the row of `df2` with closest distance is given by:

``  closest.df1 <- apply(dateDist, 1, which.min)``

Finally, the merge is performed manually:

``cbind(rename_with(df1, ~paste0("df1.", "", .x)),      rename_with(df2[closest.df1,], ~paste0("df2.", "", .x)))##>+                 df1.date df1.value            df2.date  df2.value##>1    2021-11-23 20:56:06       500 2021-11-23 20:55:47  Ship Emma##>1.1  2021-11-23 20:56:07       900 2021-11-23 20:55:47  Ship Emma##>1.2  2021-11-23 20:56:08      1000 2021-11-23 20:55:47  Ship Emma##>1.3  2021-11-23 20:56:09       200 2021-11-23 20:55:47  Ship Emma##>1.4  2021-11-23 20:56:10       300 2021-11-23 20:55:47  Ship Emma##>1.5  2021-11-23 20:56:11        10 2021-11-23 20:55:47  Ship Emma##>5    2021-11-23 22:13:56      1000 2021-11-23 22:16:01   Ship Amy##>5.1  2021-11-23 22:13:57       450 2021-11-23 22:16:01   Ship Amy##>5.2  2021-11-23 22:13:58       950 2021-11-23 22:16:01   Ship Amy##>5.3  2021-11-23 22:13:59       600 2021-11-23 22:16:01   Ship Amy##>12   2021-11-24 03:23:21       100 2021-11-24 03:23:37 Ship Sally##>12.1 2021-11-24 03:23:22       750 2021-11-24 03:23:37 Ship Sally##>12.2 2021-11-24 03:23:23       150 2021-11-24 03:23:37 Ship Sally##>12.3 2021-11-24 03:23:24       200 2021-11-24 03:23:37 Ship Sally##>12.4 2021-11-24 03:23:25       300 2021-11-24 03:23:37 Ship Sally##>12.5 2021-11-24 03:24:34       400 2021-11-24 03:23:37 Ship Sally##>12.6 2021-11-24 03:24:35       900 2021-11-24 03:23:37 Ship Sally##>12.7 2021-11-24 03:24:36      1020 2021-11-24 03:23:37 Ship Sally##>12.8 2021-11-24 03:24:37       800 2021-11-24 03:23:37 Ship Sally``

The second approach involves first calculating the cartesian product of all the rows of `df1` and `df2` and then selecting only the rows with the minimum distance. The trick here is to use `inner_join(..., by =character())` to get all the combinations of the two dataframes :

``mutate(df1, id = row_number()) %>%    inner_join(mutate(df2, id = row_number()),by = character()) |>    mutate(dist = abs(date.x - date.y)) |>    group_by(id.x) |>    filter(dist == min(dist)) |>    select(-id.x, -id.y, -dist)  ##>+ # A tibble: 19 × 7  ##># Groups:   id.x [19]  ##>   date.x              value.x  id.x date.y              value.y     id.y dist    ##>   <dttm>                <dbl> <int> <dttm>              <chr>      <int> <drtn>  ##> 1 2021-11-23 20:56:06     500     1 2021-11-23 20:55:47 Ship Emma      1  19 s…  ##> 2 2021-11-23 20:56:07     900     2 2021-11-23 20:55:47 Ship Emma      1  20 s…  ##> 3 2021-11-23 20:56:08    1000     3 2021-11-23 20:55:47 Ship Emma      1  21 s…  ##> 4 2021-11-23 20:56:09     200     4 2021-11-23 20:55:47 Ship Emma      1  22 s…  ##> 5 2021-11-23 20:56:10     300     5 2021-11-23 20:55:47 Ship Emma      1  23 s…  ##> 6 2021-11-23 20:56:11      10     6 2021-11-23 20:55:47 Ship Emma      1  24 s…  ##> 7 2021-11-23 22:13:56    1000     7 2021-11-23 22:16:01 Ship Amy       5 125 s…  ##> 8 2021-11-23 22:13:57     450     8 2021-11-23 22:16:01 Ship Amy       5 124 s…  ##> 9 2021-11-23 22:13:58     950     9 2021-11-23 22:16:01 Ship Amy       5 123 s…  ##>10 2021-11-23 22:13:59     600    10 2021-11-23 22:16:01 Ship Amy       5 122 s…  ##>11 2021-11-24 03:23:21     100    11 2021-11-24 03:23:37 Ship Sally    12  16 s…  ##>12 2021-11-24 03:23:22     750    12 2021-11-24 03:23:37 Ship Sally    12  15 s…  ##>13 2021-11-24 03:23:23     150    13 2021-11-24 03:23:37 Ship Sally    12  14 s…  ##>14 2021-11-24 03:23:24     200    14 2021-11-24 03:23:37 Ship Sally    12  13 s…  ##>15 2021-11-24 03:23:25     300    15 2021-11-24 03:23:37 Ship Sally    12  12 s…  ##>16 2021-11-24 03:24:34     400    16 2021-11-24 03:23:37 Ship Sally    12  57 s…  ##>17 2021-11-24 03:24:35     900    17 2021-11-24 03:23:37 Ship Sally    12  58 s…  ##>18 2021-11-24 03:24:36    1020    18 2021-11-24 03:23:37 Ship Sally    12  59 s…  ##>19 2021-11-24 03:24:37     800    19 2021-11-24 03:23:37 Ship Sally    12  60 s…``

## How to match by nearest date from two data frames?

One way is to use the `roll=Inf` feature from the `data.table` package as follows:

``require(data.table)   ## >= 1.9.2setDT(df1)            ## convert to data.table by referencesetDT(df2)            ## samedf1[, date := date1]  ## create a duplicate of 'date1'setkey(df1, date1)    ## set the column to perform the join onsetkey(df2, date2)    ## same as aboveans = df1[df2, roll=Inf] ## perform rolling join## change names and set column order as required, by referencesetnames(ans, c('date','date1'), c('date1','date2'))setcolorder(ans, c('epi', 'date1', 'bmi', 'date2'))> ans#   epi      date1      bmi      date2#1:   1 2014-01-08 33.57532 2014-01-08#2:   2 2014-01-15 22.63604 2014-01-15#3:   3 2014-01-26 22.22079 2014-01-28#4:   4 2014-02-01 15.16691 2014-02-05#5:   5 2014-02-15 27.48925 2014-02-24``

## Joining two data frames on the closest date in R

You were almost there.

In the `DT[i,on]` syntax, `i` should be `survey` to join on all its rows

``setDT(survey)setDT(price)survey_price <- price[survey,on=.(date=actual.date),roll="nearest"]survey_price         date price.var1 price.var2         ID       <IDat>      <num>      <num>      <int>1: 2012-09-26   4.100958   4.147176   201203772: 2020-11-23   2.747339   2.739948 20204558223: 2012-10-26   4.100958   4.147176   201267584: 2012-10-25   4.100958   4.147176   201242415: 2020-11-28   2.747339   2.739948 2020426572``

## Matching nearest date between two data.frames

You could convert character dates to `Date` and use `roll='nearest'`:

``setDT(df)            ## convert to data.table by referencesetDT(ONS)            ## samedf[, date := as.Date(date_of_sampling)]  ## create a duplicate of 'df'setkey(df, date)    ## set the column to perform the join onONS[, date := as.Date(sample_date_midpoint)]  ## create a duplicate of 'ONS'setkey(ONS, date)    ## same as aboveONS[df, roll='nearest'][    abs(difftime(sample_date_midpoint,date_of_sampling,unit='day'))<5]  # Key: <date>#   sample_date_midpoint prevalence       date date_of_sampling operator num_passengers# <char>      <num>     <Date>           <char>   <char>          <num># 1:           2021-01-01  0.1964160 2021-01-01       2021-01-01        A            204# 2:           2021-01-01  0.1964160 2021-01-03       2021-01-03        B            100# 3:           2021-02-05  0.3906553 2021-02-03       2021-02-03        B            400# 4:           2021-02-05  0.3906553 2021-02-04       2021-02-04        A            155``

## Merge nearest date, and related variables from a another dataframe by group

Here is the solution based on the base package:

``z <- lapply(intersect(df1\$ID,df2\$ID),function(id) {   d1 <- subset(df1,ID==id)   d2 <- subset(df2,ID==id)   d1\$indices <- sapply(d1\$dateTarget,function(d) which.min(abs(d2\$dateTarget - d)))   d2\$indices <- 1:nrow(d2)   merge(d1,d2,by=c('ID','indices'))  })z2 <- do.call(rbind,z)z2\$indices <- NULLprint(z2)#    ID dateTarget.x Value dateTarget.y ValueMatch# 1   3   2015-11-14    47   2015-07-06         48# 2   3   2015-12-08    98   2015-07-06         48# 3   3   2015-02-22    52   2015-03-09         94# 4   3   2014-11-17    68   2014-12-15         95# 5   3   2013-05-30    91   2013-04-01         85# 6   1   2013-11-04    70   2014-02-21         35# 7   1   2014-12-29    18   2014-12-06         88# 8   2   2013-01-14    52   2013-04-08         77# 9   2   2015-07-29    97   2015-08-01         68# 10  2   2015-06-15    98   2015-08-01         68``

## merging data frames based on multiple nearest matches in R

Without knowing exactly how you want the result formatted, you can do this with the data.table rolling join with `roll="nearest"` that you mentioned.

In this case I've `melt`ed both sets of data to long datasets so that the matching can be done in a single join.

``library(data.table)setDT(df1)setDT(df2)df1[    match(        melt(df1, id.vars="julian")[            melt(df2, measure.vars=names(df2)),            on=c("variable","value"), roll="nearest"]\$julian,        julian),]#   julian        a        b         c         d#1:      9 12.02948 13.54714  7.659482  6.784113#2:     20 28.74620 20.24871 18.523935 17.801711#3:     10 13.00511 14.57352  8.296155  6.942622#4:     24 30.26931 24.20554 20.253149 22.017714``

If you want separate tables for each join instead you could do something like:

``lapply(names(df2), \(var)  df1[df2, on=var, roll="nearest", .SD, .SDcols=names(df1)] )``

## Python - Join two dataframes based on closest date match and additional column

My output is a bit different in `merge_asof` method:

``df1['alert_dt'] = pd.to_datetime(df1['alert_dt'], dayfirst=True)df2['inv_dt'] = pd.to_datetime(df2['inv_dt'], dayfirst=True)df = pd.merge_asof(df2.sort_values('inv_dt'),                    df1.sort_values('alert_dt'),                    left_on='inv_dt',                    right_on='alert_dt',                    by='pty')print (df)      inv_dt pty   alert_dt0 2020-06-07   A 2020-06-011 2020-06-14   A 2020-06-122 2020-06-27   A 2020-06-273 2020-07-12   B 2020-06-124 2020-08-15   B 2020-07-15``