Merge Data Based on Nearest Date R

Merge nearest date, and related variables from a another dataframe by group

Here is the solution based on the base package:

z <- lapply(intersect(df1$ID,df2$ID),function(id) {
d1 <- subset(df1,ID==id)
d2 <- subset(df2,ID==id)

d1$indices <- sapply(d1$dateTarget,function(d) which.min(abs(d2$dateTarget - d)))
d2$indices <- 1:nrow(d2)

merge(d1,d2,by=c('ID','indices'))
})

z2 <- do.call(rbind,z)
z2$indices <- NULL

print(z2)

# ID dateTarget.x Value dateTarget.y ValueMatch
# 1 3 2015-11-14 47 2015-07-06 48
# 2 3 2015-12-08 98 2015-07-06 48
# 3 3 2015-02-22 52 2015-03-09 94
# 4 3 2014-11-17 68 2014-12-15 95
# 5 3 2013-05-30 91 2013-04-01 85
# 6 1 2013-11-04 70 2014-02-21 35
# 7 1 2014-12-29 18 2014-12-06 88
# 8 2 2013-01-14 52 2013-04-08 77
# 9 2 2015-07-29 97 2015-08-01 68
# 10 2 2015-06-15 98 2015-08-01 68

R merge two dataframes by closest date

Left join dfB to dfA, take the difference between dates per row and choose the smallest diff per id.

left_join(dfA, dfB, by = "id") %>%
mutate(date_diff = abs(Answer_Date.x - Answer_Date.y)) %>%
group_by(id) %>%
filter(date_diff == min(date_diff)) %>%
select(id, Answer_Date.x, Answer_Date.y, starts_with("x"), date_diff)

Then output is:

# A tibble: 2 x 8
# Groups: id [2]
id Answer_Date.x Answer_Date.y x1 x2 x3 x4 date_diff
<fct> <date> <date> <dbl> <dbl> <dbl> <dbl> <drtn>
1 Apple 2013-12-07 2013-12-05 1 10 5 50 2 days
2 Banana 2014-12-07 2014-12-10 2 20 3 30 3 days

By the way, in your sample code the third Answer_Date in the definition of dfB should be "2014-12-10" instead of "2015-12-10".

Merge two dataframes by nearest date in R

I suggest two approaches. The first uses a distance matrix and perform a left_join of df1 to df2. Namely the distance matrix is given by:

dateDist <- outer(pull(df1, date), pull(df2, date), "-") %>%
abs()

Next, for each row of df1, the row of df2 with closest distance is given by:

  closest.df1 <- apply(dateDist, 1, which.min)

Finally, the merge is performed manually:

cbind(rename_with(df1, ~paste0("df1.", "", .x)),
rename_with(df2[closest.df1,], ~paste0("df2.", "", .x)))

##>+ df1.date df1.value df2.date df2.value
##>1 2021-11-23 20:56:06 500 2021-11-23 20:55:47 Ship Emma
##>1.1 2021-11-23 20:56:07 900 2021-11-23 20:55:47 Ship Emma
##>1.2 2021-11-23 20:56:08 1000 2021-11-23 20:55:47 Ship Emma
##>1.3 2021-11-23 20:56:09 200 2021-11-23 20:55:47 Ship Emma
##>1.4 2021-11-23 20:56:10 300 2021-11-23 20:55:47 Ship Emma
##>1.5 2021-11-23 20:56:11 10 2021-11-23 20:55:47 Ship Emma
##>5 2021-11-23 22:13:56 1000 2021-11-23 22:16:01 Ship Amy
##>5.1 2021-11-23 22:13:57 450 2021-11-23 22:16:01 Ship Amy
##>5.2 2021-11-23 22:13:58 950 2021-11-23 22:16:01 Ship Amy
##>5.3 2021-11-23 22:13:59 600 2021-11-23 22:16:01 Ship Amy
##>12 2021-11-24 03:23:21 100 2021-11-24 03:23:37 Ship Sally
##>12.1 2021-11-24 03:23:22 750 2021-11-24 03:23:37 Ship Sally
##>12.2 2021-11-24 03:23:23 150 2021-11-24 03:23:37 Ship Sally
##>12.3 2021-11-24 03:23:24 200 2021-11-24 03:23:37 Ship Sally
##>12.4 2021-11-24 03:23:25 300 2021-11-24 03:23:37 Ship Sally
##>12.5 2021-11-24 03:24:34 400 2021-11-24 03:23:37 Ship Sally
##>12.6 2021-11-24 03:24:35 900 2021-11-24 03:23:37 Ship Sally
##>12.7 2021-11-24 03:24:36 1020 2021-11-24 03:23:37 Ship Sally
##>12.8 2021-11-24 03:24:37 800 2021-11-24 03:23:37 Ship Sally

The second approach involves first calculating the cartesian product of all the rows of df1 and df2 and then selecting only the rows with the minimum distance. The trick here is to use inner_join(..., by =character()) to get all the combinations of the two dataframes :

mutate(df1, id = row_number()) %>%
inner_join(mutate(df2, id = row_number()),by = character()) |>
mutate(dist = abs(date.x - date.y)) |>
group_by(id.x) |>
filter(dist == min(dist)) |>
select(-id.x, -id.y, -dist)

##>+ # A tibble: 19 × 7
##># Groups: id.x [19]
##> date.x value.x id.x date.y value.y id.y dist
##> <dttm> <dbl> <int> <dttm> <chr> <int> <drtn>
##> 1 2021-11-23 20:56:06 500 1 2021-11-23 20:55:47 Ship Emma 1 19 s…
##> 2 2021-11-23 20:56:07 900 2 2021-11-23 20:55:47 Ship Emma 1 20 s…
##> 3 2021-11-23 20:56:08 1000 3 2021-11-23 20:55:47 Ship Emma 1 21 s…
##> 4 2021-11-23 20:56:09 200 4 2021-11-23 20:55:47 Ship Emma 1 22 s…
##> 5 2021-11-23 20:56:10 300 5 2021-11-23 20:55:47 Ship Emma 1 23 s…
##> 6 2021-11-23 20:56:11 10 6 2021-11-23 20:55:47 Ship Emma 1 24 s…
##> 7 2021-11-23 22:13:56 1000 7 2021-11-23 22:16:01 Ship Amy 5 125 s…
##> 8 2021-11-23 22:13:57 450 8 2021-11-23 22:16:01 Ship Amy 5 124 s…
##> 9 2021-11-23 22:13:58 950 9 2021-11-23 22:16:01 Ship Amy 5 123 s…
##>10 2021-11-23 22:13:59 600 10 2021-11-23 22:16:01 Ship Amy 5 122 s…
##>11 2021-11-24 03:23:21 100 11 2021-11-24 03:23:37 Ship Sally 12 16 s…
##>12 2021-11-24 03:23:22 750 12 2021-11-24 03:23:37 Ship Sally 12 15 s…
##>13 2021-11-24 03:23:23 150 13 2021-11-24 03:23:37 Ship Sally 12 14 s…
##>14 2021-11-24 03:23:24 200 14 2021-11-24 03:23:37 Ship Sally 12 13 s…
##>15 2021-11-24 03:23:25 300 15 2021-11-24 03:23:37 Ship Sally 12 12 s…
##>16 2021-11-24 03:24:34 400 16 2021-11-24 03:23:37 Ship Sally 12 57 s…
##>17 2021-11-24 03:24:35 900 17 2021-11-24 03:23:37 Ship Sally 12 58 s…
##>18 2021-11-24 03:24:36 1020 18 2021-11-24 03:23:37 Ship Sally 12 59 s…
##>19 2021-11-24 03:24:37 800 19 2021-11-24 03:23:37 Ship Sally 12 60 s…

merge with nearest dates in R

A data.table option with roll = "nearest":

setDT(dataset1)[, c("Date", "Date1") := as.Date(Date)]
setDT(dataset2)[, c("Date", "nearest") := as.Date(Date)]
dataset2[dataset1, on = .(ID, Date), roll = "nearest"][, Date := NULL][]

ID nearest Date1
1: A 2021-05-02 2021-03-18
2: A 2021-05-02 2021-04-27
3: A 2021-05-02 2021-04-05
4: A 2021-05-02 2021-05-02
5: A 2021-01-01 2021-02-08
6: A 2021-06-16 2021-06-02
7: A 2021-06-16 2021-05-29

Other option to match number of rows:

dataset1[dataset2, on = .(ID, Date), roll = "nearest"][, Date := NULL][]
ID Date1 nearest
1: A 2021-02-08 2021-01-01
2: A 2021-02-08 2021-01-01
3: A 2021-05-02 2021-05-02
4: A 2021-05-02 2021-05-09
5: A 2021-05-02 2021-05-09
6: A 2021-05-02 2021-05-09
7: A 2021-05-02 2021-05-09
8: A 2021-06-02 2021-06-16
9: A 2021-06-02 2021-06-27

Merge two data frames based on nearest date if within a certain proximity of each other

There is probably a more succinct way of doing this, but the below answers my question. This was informed by the answer from crestor.

library(lubridate)
library(tidyverse)

# dates in date format
df1$Date.df1 <- as.Date(df1$Date.df1, "%d/%m/%Y")
df2$Date.df2 <- as.Date(df2$Date.df2, "%d/%m/%Y")

#join rows in df1 and df2 that are nearest in submission date and within two months of each other
df1 <- df1 %>%
as_tibble() %>%
mutate(across(starts_with("Date."), ymd))

df2 <- df2 %>%
as_tibble() %>%
mutate(across(starts_with("Date."), ymd))

df_join <- df1 %>%
inner_join(df2, by = "ID") %>%
mutate(timediffvar = abs(time_length(difftime(Date.df1, Date.df2),"months"))) %>%
filter(
(timediffvar <= 3)
) %>%
arrange(timediffvar) %>%
group_by(ID, Date.df1) %>%
filter(row_number() == 1) %>%
ungroup() %>%
arrange(timediffvar) %>%
group_by(ID, Date.df2) %>%
filter(row_number() == 1) %>%
ungroup()

# identify entries not in the joined above
df1_notjoined <- anti_join(df1, df_join, by=c("ID", "Date.df1"))
df2_notjoined <- anti_join(df2, df_join, by=c("ID", "Date.df2"))

# join all entries together
mergevars_df1 <- names(df1)
mergevars_df2 <- names(df2)

df3 <- df_join %>%
full_join(df1_notjoined, by = mergevars_df1) %>%
full_join(df2_notjoined, by = mergevars_df2) %>%
arrange(ID, Date.df1, Date.df2) %>%
select("ID", "Date.df1", "Date.df2", "V1_df1","V2_df1","V1_df2","V2_df2")

Join Two Data Frames By Closest Date Without Going Over In R

This can be done in SQL with the default SQLite backend using left join on ticker and on df2 date being less than or equal to the df1 date and then grouping over df1 and taking the max date from df2 of those joined to df1.

library(sqldf)
sqldf("select df1.*, max(df2.date), df2.randomVar from df1
left join df2 on df1.ticker = df2.ticker and df1.date >= df2.date
group by df1.rowid
order by df1.rowid")[-3]

giving:

  ticker       date  randomVar
1 AAPL 2019-01-06 -0.5321493
2 AAPL 2019-02-06 0.2121993
3 MSFT 2019-01-06 1.2336315
4 MSFT 2019-05-02 -0.5349596

Note

Inputs in reproducible form:

Lines1 <- "ticker       date
1 AAPL 2019-01-06
2 AAPL 2019-02-06
3 MSFT 2019-01-06
4 MSFT 2019-05-02"

Lines2 <- "ticker date randomVar
1 AAPL 2019-01-03 -0.5321493
2 AAPL 2019-01-07 -0.7909461
3 AAPL 2019-02-06 0.2121993
4 MSFT 2019-01-05 1.2336315
5 MSFT 2019-01-07 -0.2729354
6 MSFT 2019-05-02 -0.5349596"

df1 <- read.table(text = Lines1, as.is = TRUE)
df2 <- read.table(text = Lines2, as.is = TRUE)

How to join two dataframes by nearest time-date?

data.table should work for this (can you explain the error you're coming up against?), although it does tend to convert POSIXlt to POSIXct on its own (perhaps do that conversion on your datetime column manually to keep data.table happy). Also make sure you're setting the key column before using roll.

(I've created my own example tables here to make my life that little bit easier. If you want to use dput on yours, I'm happy to update this example with your data):

new <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00" ) ), data.new = c( "t","u","v" ) )
head( new, 2 )

date data.new
1: 2016-03-02 12:20:00 t
2: 2016-03-07 12:20:00 u

old <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2015-03-02 12:20:00" ) ), data.old = c( "a","b","c","d" ) )
head( old, 2 )

date data.old
1: 2016-03-02 12:20:00 a
2: 2016-03-07 12:20:00 b

setkey( new, date )
setkey( old, date )

combined <- new[ old, roll = "nearest" ]
combined

date data.new data.old
1: 2015-03-02 12:20:00 t d
2: 2016-03-02 12:20:00 t a
3: 2016-03-07 12:20:00 u b
4: 2016-04-02 12:20:00 v c

I've intentionally made the two tables different row lengths, in order to show how the rolling join deals with multiple matches. You can switch the way it joins with:

combined <- old[ new, roll = "nearest" ]
combined

date data.old data.new
1: 2016-03-02 12:20:00 a t
2: 2016-03-07 12:20:00 b u
3: 2016-04-02 12:20:00 c v


Related Topics



Leave a reply



Submit