Merge Data Based on Nearest Date R

Merge nearest date, and related variables from a another dataframe by group

Here is the solution based on the base package:

z <- lapply(intersect(df1$ID,df2$ID),function(id) {
   d1 <- subset(df1,ID==id)
   d2 <- subset(df2,ID==id)

   d1$indices <- sapply(d1$dateTarget,function(d) which.min(abs(d2$dateTarget - d)))
   d2$indices <- 1:nrow(d2)

   merge(d1,d2,by=c('ID','indices'))
  })

z2 <- do.call(rbind,z)
z2$indices <- NULL

print(z2)

#    ID dateTarget.x Value dateTarget.y ValueMatch
# 1   3   2015-11-14    47   2015-07-06         48
# 2   3   2015-12-08    98   2015-07-06         48
# 3   3   2015-02-22    52   2015-03-09         94
# 4   3   2014-11-17    68   2014-12-15         95
# 5   3   2013-05-30    91   2013-04-01         85
# 6   1   2013-11-04    70   2014-02-21         35
# 7   1   2014-12-29    18   2014-12-06         88
# 8   2   2013-01-14    52   2013-04-08         77
# 9   2   2015-07-29    97   2015-08-01         68
# 10  2   2015-06-15    98   2015-08-01         68

R merge two dataframes by closest date

Left join dfB to dfA, take the difference between dates per row and choose the smallest diff per id.

left_join(dfA, dfB, by = "id") %>%
  mutate(date_diff = abs(Answer_Date.x - Answer_Date.y)) %>%
  group_by(id) %>%
  filter(date_diff == min(date_diff)) %>%
  select(id, Answer_Date.x, Answer_Date.y, starts_with("x"), date_diff)

Then output is:

# A tibble: 2 x 8
# Groups:   id [2]
  id     Answer_Date.x Answer_Date.y    x1    x2    x3    x4 date_diff
  <fct>  <date>        <date>        <dbl> <dbl> <dbl> <dbl> <drtn>   
1 Apple  2013-12-07    2013-12-05        1    10     5    50 2 days   
2 Banana 2014-12-07    2014-12-10        2    20     3    30 3 days

By the way, in your sample code the third Answer_Date in the definition of dfB should be "2014-12-10" instead of "2015-12-10".

Merge two dataframes by nearest date in R

I suggest two approaches. The first uses a distance matrix and perform a left_join of df1 to df2. Namely the distance matrix is given by:

dateDist <- outer(pull(df1, date), pull(df2, date), "-") %>%
    abs()

Next, for each row of df1, the row of df2 with closest distance is given by:

  closest.df1 <- apply(dateDist, 1, which.min)

Finally, the merge is performed manually:

cbind(rename_with(df1, ~paste0("df1.", "", .x)),
      rename_with(df2[closest.df1,], ~paste0("df2.", "", .x)))

##>+                 df1.date df1.value            df2.date  df2.value
##>1    2021-11-23 20:56:06       500 2021-11-23 20:55:47  Ship Emma
##>1.1  2021-11-23 20:56:07       900 2021-11-23 20:55:47  Ship Emma
##>1.2  2021-11-23 20:56:08      1000 2021-11-23 20:55:47  Ship Emma
##>1.3  2021-11-23 20:56:09       200 2021-11-23 20:55:47  Ship Emma
##>1.4  2021-11-23 20:56:10       300 2021-11-23 20:55:47  Ship Emma
##>1.5  2021-11-23 20:56:11        10 2021-11-23 20:55:47  Ship Emma
##>5    2021-11-23 22:13:56      1000 2021-11-23 22:16:01   Ship Amy
##>5.1  2021-11-23 22:13:57       450 2021-11-23 22:16:01   Ship Amy
##>5.2  2021-11-23 22:13:58       950 2021-11-23 22:16:01   Ship Amy
##>5.3  2021-11-23 22:13:59       600 2021-11-23 22:16:01   Ship Amy
##>12   2021-11-24 03:23:21       100 2021-11-24 03:23:37 Ship Sally
##>12.1 2021-11-24 03:23:22       750 2021-11-24 03:23:37 Ship Sally
##>12.2 2021-11-24 03:23:23       150 2021-11-24 03:23:37 Ship Sally
##>12.3 2021-11-24 03:23:24       200 2021-11-24 03:23:37 Ship Sally
##>12.4 2021-11-24 03:23:25       300 2021-11-24 03:23:37 Ship Sally
##>12.5 2021-11-24 03:24:34       400 2021-11-24 03:23:37 Ship Sally
##>12.6 2021-11-24 03:24:35       900 2021-11-24 03:23:37 Ship Sally
##>12.7 2021-11-24 03:24:36      1020 2021-11-24 03:23:37 Ship Sally
##>12.8 2021-11-24 03:24:37       800 2021-11-24 03:23:37 Ship Sally

The second approach involves first calculating the cartesian product of all the rows of df1 and df2 and then selecting only the rows with the minimum distance. The trick here is to use inner_join(..., by =character()) to get all the combinations of the two dataframes :

mutate(df1, id = row_number()) %>%
    inner_join(mutate(df2, id = row_number()),by = character()) |>
    mutate(dist = abs(date.x - date.y)) |>
    group_by(id.x) |>
    filter(dist == min(dist)) |>
    select(-id.x, -id.y, -dist)

  ##>+ # A tibble: 19 × 7
  ##># Groups:   id.x [19]
  ##>   date.x              value.x  id.x date.y              value.y     id.y dist  
  ##>   <dttm>                <dbl> <int> <dttm>              <chr>      <int> <drtn>
  ##> 1 2021-11-23 20:56:06     500     1 2021-11-23 20:55:47 Ship Emma      1  19 s…
  ##> 2 2021-11-23 20:56:07     900     2 2021-11-23 20:55:47 Ship Emma      1  20 s…
  ##> 3 2021-11-23 20:56:08    1000     3 2021-11-23 20:55:47 Ship Emma      1  21 s…
  ##> 4 2021-11-23 20:56:09     200     4 2021-11-23 20:55:47 Ship Emma      1  22 s…
  ##> 5 2021-11-23 20:56:10     300     5 2021-11-23 20:55:47 Ship Emma      1  23 s…
  ##> 6 2021-11-23 20:56:11      10     6 2021-11-23 20:55:47 Ship Emma      1  24 s…
  ##> 7 2021-11-23 22:13:56    1000     7 2021-11-23 22:16:01 Ship Amy       5 125 s…
  ##> 8 2021-11-23 22:13:57     450     8 2021-11-23 22:16:01 Ship Amy       5 124 s…
  ##> 9 2021-11-23 22:13:58     950     9 2021-11-23 22:16:01 Ship Amy       5 123 s…
  ##>10 2021-11-23 22:13:59     600    10 2021-11-23 22:16:01 Ship Amy       5 122 s…
  ##>11 2021-11-24 03:23:21     100    11 2021-11-24 03:23:37 Ship Sally    12  16 s…
  ##>12 2021-11-24 03:23:22     750    12 2021-11-24 03:23:37 Ship Sally    12  15 s…
  ##>13 2021-11-24 03:23:23     150    13 2021-11-24 03:23:37 Ship Sally    12  14 s…
  ##>14 2021-11-24 03:23:24     200    14 2021-11-24 03:23:37 Ship Sally    12  13 s…
  ##>15 2021-11-24 03:23:25     300    15 2021-11-24 03:23:37 Ship Sally    12  12 s…
  ##>16 2021-11-24 03:24:34     400    16 2021-11-24 03:23:37 Ship Sally    12  57 s…
  ##>17 2021-11-24 03:24:35     900    17 2021-11-24 03:23:37 Ship Sally    12  58 s…
  ##>18 2021-11-24 03:24:36    1020    18 2021-11-24 03:23:37 Ship Sally    12  59 s…
  ##>19 2021-11-24 03:24:37     800    19 2021-11-24 03:23:37 Ship Sally    12  60 s…

merge with nearest dates in R

A data.table option with roll = "nearest":

setDT(dataset1)[, c("Date", "Date1") := as.Date(Date)]
setDT(dataset2)[, c("Date", "nearest") := as.Date(Date)]
dataset2[dataset1, on = .(ID, Date), roll = "nearest"][, Date := NULL][]

   ID    nearest      Date1
1:  A 2021-05-02 2021-03-18
2:  A 2021-05-02 2021-04-27
3:  A 2021-05-02 2021-04-05
4:  A 2021-05-02 2021-05-02
5:  A 2021-01-01 2021-02-08
6:  A 2021-06-16 2021-06-02
7:  A 2021-06-16 2021-05-29

Other option to match number of rows:

dataset1[dataset2, on = .(ID, Date), roll = "nearest"][, Date := NULL][]
   ID      Date1    nearest
1:  A 2021-02-08 2021-01-01
2:  A 2021-02-08 2021-01-01
3:  A 2021-05-02 2021-05-02
4:  A 2021-05-02 2021-05-09
5:  A 2021-05-02 2021-05-09
6:  A 2021-05-02 2021-05-09
7:  A 2021-05-02 2021-05-09
8:  A 2021-06-02 2021-06-16
9:  A 2021-06-02 2021-06-27

Merge two data frames based on nearest date if within a certain proximity of each other

There is probably a more succinct way of doing this, but the below answers my question. This was informed by the answer from crestor.

library(lubridate)
library(tidyverse)

# dates in date format
df1$Date.df1 <- as.Date(df1$Date.df1, "%d/%m/%Y")
df2$Date.df2 <- as.Date(df2$Date.df2, "%d/%m/%Y")

#join rows in df1 and df2 that are nearest in submission date and within two months of each other
df1 <- df1 %>%
  as_tibble() %>%
  mutate(across(starts_with("Date."), ymd))

df2 <- df2 %>%
  as_tibble() %>%
  mutate(across(starts_with("Date."), ymd))

df_join <- df1 %>%
  inner_join(df2, by = "ID") %>%
  mutate(timediffvar = abs(time_length(difftime(Date.df1, Date.df2),"months"))) %>%
  filter(
    (timediffvar <= 3)
  ) %>%
  arrange(timediffvar) %>%
  group_by(ID, Date.df1) %>%
  filter(row_number() == 1) %>%
  ungroup() %>%
  arrange(timediffvar) %>%
  group_by(ID, Date.df2) %>%
  filter(row_number() == 1) %>%
  ungroup() 

# identify entries not in the joined above
df1_notjoined <- anti_join(df1, df_join, by=c("ID", "Date.df1"))
df2_notjoined <- anti_join(df2, df_join, by=c("ID", "Date.df2"))

# join all entries together
mergevars_df1 <- names(df1)
mergevars_df2 <- names(df2)

df3  <- df_join %>%
                  full_join(df1_notjoined,  by = mergevars_df1) %>%
                  full_join(df2_notjoined,  by = mergevars_df2) %>%
                  arrange(ID, Date.df1, Date.df2) %>%
                  select("ID", "Date.df1", "Date.df2", "V1_df1","V2_df1","V1_df2","V2_df2")

Join Two Data Frames By Closest Date Without Going Over In R

This can be done in SQL with the default SQLite backend using left join on ticker and on df2 date being less than or equal to the df1 date and then grouping over df1 and taking the max date from df2 of those joined to df1.

library(sqldf)
sqldf("select df1.*, max(df2.date), df2.randomVar from df1 
  left join df2 on df1.ticker = df2.ticker and df1.date >= df2.date
  group by df1.rowid
  order by df1.rowid")[-3]

giving:

  ticker       date  randomVar
1   AAPL 2019-01-06 -0.5321493
2   AAPL 2019-02-06  0.2121993
3   MSFT 2019-01-06  1.2336315
4   MSFT 2019-05-02 -0.5349596

Note

Inputs in reproducible form:

Lines1 <- "ticker       date
1   AAPL 2019-01-06
2   AAPL 2019-02-06
3   MSFT 2019-01-06
4   MSFT 2019-05-02"

Lines2 <- "ticker       date  randomVar
1   AAPL 2019-01-03 -0.5321493
2   AAPL 2019-01-07 -0.7909461
3   AAPL 2019-02-06  0.2121993
4   MSFT 2019-01-05  1.2336315
5   MSFT 2019-01-07 -0.2729354
6   MSFT 2019-05-02 -0.5349596"

df1 <- read.table(text = Lines1, as.is = TRUE)
df2 <- read.table(text = Lines2, as.is = TRUE)

How to join two dataframes by nearest time-date?

data.table should work for this (can you explain the error you're coming up against?), although it does tend to convert POSIXlt to POSIXct on its own (perhaps do that conversion on your datetime column manually to keep data.table happy). Also make sure you're setting the key column before using roll.

(I've created my own example tables here to make my life that little bit easier. If you want to use dput on yours, I'm happy to update this example with your data):

new <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00" ) ), data.new = c( "t","u","v" ) )
head( new, 2 )

                  date data.new
1: 2016-03-02 12:20:00        t
2: 2016-03-07 12:20:00        u

old <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2015-03-02 12:20:00" ) ), data.old = c( "a","b","c","d" ) )
head( old, 2 )

                  date data.old
1: 2016-03-02 12:20:00        a
2: 2016-03-07 12:20:00        b

setkey( new, date )
setkey( old, date )

combined <- new[ old, roll = "nearest" ]
combined

                  date data.new data.old
1: 2015-03-02 12:20:00        t        d
2: 2016-03-02 12:20:00        t        a
3: 2016-03-07 12:20:00        u        b
4: 2016-04-02 12:20:00        v        c

I've intentionally made the two tables different row lengths, in order to show how the rolling join deals with multiple matches. You can switch the way it joins with:

combined <- old[ new, roll = "nearest" ]
combined

                  date data.old data.new
1: 2016-03-02 12:20:00        a        t
2: 2016-03-07 12:20:00        b        u
3: 2016-04-02 12:20:00        c        v