Merge nearest date, and related variables from a another dataframe by group
Here is the solution based on the base package:
z <- lapply(intersect(df1$ID,df2$ID),function(id) {
d1 <- subset(df1,ID==id)
d2 <- subset(df2,ID==id)
d1$indices <- sapply(d1$dateTarget,function(d) which.min(abs(d2$dateTarget - d)))
d2$indices <- 1:nrow(d2)
merge(d1,d2,by=c('ID','indices'))
})
z2 <- do.call(rbind,z)
z2$indices <- NULL
print(z2)
# ID dateTarget.x Value dateTarget.y ValueMatch
# 1 3 2015-11-14 47 2015-07-06 48
# 2 3 2015-12-08 98 2015-07-06 48
# 3 3 2015-02-22 52 2015-03-09 94
# 4 3 2014-11-17 68 2014-12-15 95
# 5 3 2013-05-30 91 2013-04-01 85
# 6 1 2013-11-04 70 2014-02-21 35
# 7 1 2014-12-29 18 2014-12-06 88
# 8 2 2013-01-14 52 2013-04-08 77
# 9 2 2015-07-29 97 2015-08-01 68
# 10 2 2015-06-15 98 2015-08-01 68
R merge two dataframes by closest date
Left join dfB
to dfA
, take the difference between dates per row and choose the smallest diff per id.
left_join(dfA, dfB, by = "id") %>%
mutate(date_diff = abs(Answer_Date.x - Answer_Date.y)) %>%
group_by(id) %>%
filter(date_diff == min(date_diff)) %>%
select(id, Answer_Date.x, Answer_Date.y, starts_with("x"), date_diff)
Then output is:
# A tibble: 2 x 8
# Groups: id [2]
id Answer_Date.x Answer_Date.y x1 x2 x3 x4 date_diff
<fct> <date> <date> <dbl> <dbl> <dbl> <dbl> <drtn>
1 Apple 2013-12-07 2013-12-05 1 10 5 50 2 days
2 Banana 2014-12-07 2014-12-10 2 20 3 30 3 days
By the way, in your sample code the third Answer_Date
in the definition of dfB
should be "2014-12-10"
instead of "2015-12-10"
.
Merge two dataframes by nearest date in R
I suggest two approaches. The first uses a distance matrix and perform a left_join of df1 to df2. Namely the distance matrix is given by:
dateDist <- outer(pull(df1, date), pull(df2, date), "-") %>%
abs()
Next, for each row of df1
, the row of df2
with closest distance is given by:
closest.df1 <- apply(dateDist, 1, which.min)
Finally, the merge is performed manually:
cbind(rename_with(df1, ~paste0("df1.", "", .x)),
rename_with(df2[closest.df1,], ~paste0("df2.", "", .x)))
##>+ df1.date df1.value df2.date df2.value
##>1 2021-11-23 20:56:06 500 2021-11-23 20:55:47 Ship Emma
##>1.1 2021-11-23 20:56:07 900 2021-11-23 20:55:47 Ship Emma
##>1.2 2021-11-23 20:56:08 1000 2021-11-23 20:55:47 Ship Emma
##>1.3 2021-11-23 20:56:09 200 2021-11-23 20:55:47 Ship Emma
##>1.4 2021-11-23 20:56:10 300 2021-11-23 20:55:47 Ship Emma
##>1.5 2021-11-23 20:56:11 10 2021-11-23 20:55:47 Ship Emma
##>5 2021-11-23 22:13:56 1000 2021-11-23 22:16:01 Ship Amy
##>5.1 2021-11-23 22:13:57 450 2021-11-23 22:16:01 Ship Amy
##>5.2 2021-11-23 22:13:58 950 2021-11-23 22:16:01 Ship Amy
##>5.3 2021-11-23 22:13:59 600 2021-11-23 22:16:01 Ship Amy
##>12 2021-11-24 03:23:21 100 2021-11-24 03:23:37 Ship Sally
##>12.1 2021-11-24 03:23:22 750 2021-11-24 03:23:37 Ship Sally
##>12.2 2021-11-24 03:23:23 150 2021-11-24 03:23:37 Ship Sally
##>12.3 2021-11-24 03:23:24 200 2021-11-24 03:23:37 Ship Sally
##>12.4 2021-11-24 03:23:25 300 2021-11-24 03:23:37 Ship Sally
##>12.5 2021-11-24 03:24:34 400 2021-11-24 03:23:37 Ship Sally
##>12.6 2021-11-24 03:24:35 900 2021-11-24 03:23:37 Ship Sally
##>12.7 2021-11-24 03:24:36 1020 2021-11-24 03:23:37 Ship Sally
##>12.8 2021-11-24 03:24:37 800 2021-11-24 03:23:37 Ship Sally
The second approach involves first calculating the cartesian product of all the rows of df1
and df2
and then selecting only the rows with the minimum distance. The trick here is to use inner_join(..., by =character())
to get all the combinations of the two dataframes :
mutate(df1, id = row_number()) %>%
inner_join(mutate(df2, id = row_number()),by = character()) |>
mutate(dist = abs(date.x - date.y)) |>
group_by(id.x) |>
filter(dist == min(dist)) |>
select(-id.x, -id.y, -dist)
##>+ # A tibble: 19 × 7
##># Groups: id.x [19]
##> date.x value.x id.x date.y value.y id.y dist
##> <dttm> <dbl> <int> <dttm> <chr> <int> <drtn>
##> 1 2021-11-23 20:56:06 500 1 2021-11-23 20:55:47 Ship Emma 1 19 s…
##> 2 2021-11-23 20:56:07 900 2 2021-11-23 20:55:47 Ship Emma 1 20 s…
##> 3 2021-11-23 20:56:08 1000 3 2021-11-23 20:55:47 Ship Emma 1 21 s…
##> 4 2021-11-23 20:56:09 200 4 2021-11-23 20:55:47 Ship Emma 1 22 s…
##> 5 2021-11-23 20:56:10 300 5 2021-11-23 20:55:47 Ship Emma 1 23 s…
##> 6 2021-11-23 20:56:11 10 6 2021-11-23 20:55:47 Ship Emma 1 24 s…
##> 7 2021-11-23 22:13:56 1000 7 2021-11-23 22:16:01 Ship Amy 5 125 s…
##> 8 2021-11-23 22:13:57 450 8 2021-11-23 22:16:01 Ship Amy 5 124 s…
##> 9 2021-11-23 22:13:58 950 9 2021-11-23 22:16:01 Ship Amy 5 123 s…
##>10 2021-11-23 22:13:59 600 10 2021-11-23 22:16:01 Ship Amy 5 122 s…
##>11 2021-11-24 03:23:21 100 11 2021-11-24 03:23:37 Ship Sally 12 16 s…
##>12 2021-11-24 03:23:22 750 12 2021-11-24 03:23:37 Ship Sally 12 15 s…
##>13 2021-11-24 03:23:23 150 13 2021-11-24 03:23:37 Ship Sally 12 14 s…
##>14 2021-11-24 03:23:24 200 14 2021-11-24 03:23:37 Ship Sally 12 13 s…
##>15 2021-11-24 03:23:25 300 15 2021-11-24 03:23:37 Ship Sally 12 12 s…
##>16 2021-11-24 03:24:34 400 16 2021-11-24 03:23:37 Ship Sally 12 57 s…
##>17 2021-11-24 03:24:35 900 17 2021-11-24 03:23:37 Ship Sally 12 58 s…
##>18 2021-11-24 03:24:36 1020 18 2021-11-24 03:23:37 Ship Sally 12 59 s…
##>19 2021-11-24 03:24:37 800 19 2021-11-24 03:23:37 Ship Sally 12 60 s…
merge with nearest dates in R
A data.table
option with roll = "nearest"
:
setDT(dataset1)[, c("Date", "Date1") := as.Date(Date)]
setDT(dataset2)[, c("Date", "nearest") := as.Date(Date)]
dataset2[dataset1, on = .(ID, Date), roll = "nearest"][, Date := NULL][]
ID nearest Date1
1: A 2021-05-02 2021-03-18
2: A 2021-05-02 2021-04-27
3: A 2021-05-02 2021-04-05
4: A 2021-05-02 2021-05-02
5: A 2021-01-01 2021-02-08
6: A 2021-06-16 2021-06-02
7: A 2021-06-16 2021-05-29
Other option to match number of rows:
dataset1[dataset2, on = .(ID, Date), roll = "nearest"][, Date := NULL][]
ID Date1 nearest
1: A 2021-02-08 2021-01-01
2: A 2021-02-08 2021-01-01
3: A 2021-05-02 2021-05-02
4: A 2021-05-02 2021-05-09
5: A 2021-05-02 2021-05-09
6: A 2021-05-02 2021-05-09
7: A 2021-05-02 2021-05-09
8: A 2021-06-02 2021-06-16
9: A 2021-06-02 2021-06-27
Merge two data frames based on nearest date if within a certain proximity of each other
There is probably a more succinct way of doing this, but the below answers my question. This was informed by the answer from crestor.
library(lubridate)
library(tidyverse)
# dates in date format
df1$Date.df1 <- as.Date(df1$Date.df1, "%d/%m/%Y")
df2$Date.df2 <- as.Date(df2$Date.df2, "%d/%m/%Y")
#join rows in df1 and df2 that are nearest in submission date and within two months of each other
df1 <- df1 %>%
as_tibble() %>%
mutate(across(starts_with("Date."), ymd))
df2 <- df2 %>%
as_tibble() %>%
mutate(across(starts_with("Date."), ymd))
df_join <- df1 %>%
inner_join(df2, by = "ID") %>%
mutate(timediffvar = abs(time_length(difftime(Date.df1, Date.df2),"months"))) %>%
filter(
(timediffvar <= 3)
) %>%
arrange(timediffvar) %>%
group_by(ID, Date.df1) %>%
filter(row_number() == 1) %>%
ungroup() %>%
arrange(timediffvar) %>%
group_by(ID, Date.df2) %>%
filter(row_number() == 1) %>%
ungroup()
# identify entries not in the joined above
df1_notjoined <- anti_join(df1, df_join, by=c("ID", "Date.df1"))
df2_notjoined <- anti_join(df2, df_join, by=c("ID", "Date.df2"))
# join all entries together
mergevars_df1 <- names(df1)
mergevars_df2 <- names(df2)
df3 <- df_join %>%
full_join(df1_notjoined, by = mergevars_df1) %>%
full_join(df2_notjoined, by = mergevars_df2) %>%
arrange(ID, Date.df1, Date.df2) %>%
select("ID", "Date.df1", "Date.df2", "V1_df1","V2_df1","V1_df2","V2_df2")
Join Two Data Frames By Closest Date Without Going Over In R
This can be done in SQL with the default SQLite backend using left join on ticker and on df2 date being less than or equal to the df1 date and then grouping over df1 and taking the max date from df2 of those joined to df1.
library(sqldf)
sqldf("select df1.*, max(df2.date), df2.randomVar from df1
left join df2 on df1.ticker = df2.ticker and df1.date >= df2.date
group by df1.rowid
order by df1.rowid")[-3]
giving:
ticker date randomVar
1 AAPL 2019-01-06 -0.5321493
2 AAPL 2019-02-06 0.2121993
3 MSFT 2019-01-06 1.2336315
4 MSFT 2019-05-02 -0.5349596
Note
Inputs in reproducible form:
Lines1 <- "ticker date
1 AAPL 2019-01-06
2 AAPL 2019-02-06
3 MSFT 2019-01-06
4 MSFT 2019-05-02"
Lines2 <- "ticker date randomVar
1 AAPL 2019-01-03 -0.5321493
2 AAPL 2019-01-07 -0.7909461
3 AAPL 2019-02-06 0.2121993
4 MSFT 2019-01-05 1.2336315
5 MSFT 2019-01-07 -0.2729354
6 MSFT 2019-05-02 -0.5349596"
df1 <- read.table(text = Lines1, as.is = TRUE)
df2 <- read.table(text = Lines2, as.is = TRUE)
How to join two dataframes by nearest time-date?
data.table
should work for this (can you explain the error you're coming up against?), although it does tend to convert POSIXlt to POSIXct on its own (perhaps do that conversion on your datetime column manually to keep data.table
happy). Also make sure you're setting the key column before using roll
.
(I've created my own example tables here to make my life that little bit easier. If you want to use dput on yours, I'm happy to update this example with your data):
new <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00" ) ), data.new = c( "t","u","v" ) )
head( new, 2 )
date data.new
1: 2016-03-02 12:20:00 t
2: 2016-03-07 12:20:00 u
old <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2015-03-02 12:20:00" ) ), data.old = c( "a","b","c","d" ) )
head( old, 2 )
date data.old
1: 2016-03-02 12:20:00 a
2: 2016-03-07 12:20:00 b
setkey( new, date )
setkey( old, date )
combined <- new[ old, roll = "nearest" ]
combined
date data.new data.old
1: 2015-03-02 12:20:00 t d
2: 2016-03-02 12:20:00 t a
3: 2016-03-07 12:20:00 u b
4: 2016-04-02 12:20:00 v c
I've intentionally made the two tables different row lengths, in order to show how the rolling join deals with multiple matches. You can switch the way it joins with:
combined <- old[ new, roll = "nearest" ]
combined
date data.old data.new
1: 2016-03-02 12:20:00 a t
2: 2016-03-07 12:20:00 b u
3: 2016-04-02 12:20:00 c v
Related Topics
Install Previous Versions of R on Ubuntu
Ggplot2: How to Separate Geom_Polygon and Geom_Line in Legend Keys
Splitting (1:N)[Boolean] into Contiguous Sequences
Show Source Code for a Function in a Package in R
Error with Pred$Fit Using Nls in Ggplot2
Split Violin Plot with Ggplot2 with Quantiles
Creating a Table with Individual Trials from a Frequency Table in R (Inverse of Table Function)
Is Ifelse Ever Appropriate in a Non-Vectorized Situation and Vice-Versa
Total of a Column in Dt Datatables in Shiny
Use Different Font Sizes for Different Portions of Text in Ggplot2 Title
Column Name with Brackets or Other Punctuations for Dplyr Group_By
How to Install The Fftw3 Package of R in Ubuntu 12.04
How to Programmatically Create Binary Columns Based on a Categorical Variable in Data.Table
How to Find The Indices Where There Are N Consecutive Zeroes in a Row