Merge two dataframes by nearest date in R
I suggest two approaches. The first uses a distance matrix and perform a left_join of df1 to df2. Namely the distance matrix is given by:
dateDist <- outer(pull(df1, date), pull(df2, date), "-") %>%
abs()
Next, for each row of df1
, the row of df2
with closest distance is given by:
closest.df1 <- apply(dateDist, 1, which.min)
Finally, the merge is performed manually:
cbind(rename_with(df1, ~paste0("df1.", "", .x)),
rename_with(df2[closest.df1,], ~paste0("df2.", "", .x)))
##>+ df1.date df1.value df2.date df2.value
##>1 2021-11-23 20:56:06 500 2021-11-23 20:55:47 Ship Emma
##>1.1 2021-11-23 20:56:07 900 2021-11-23 20:55:47 Ship Emma
##>1.2 2021-11-23 20:56:08 1000 2021-11-23 20:55:47 Ship Emma
##>1.3 2021-11-23 20:56:09 200 2021-11-23 20:55:47 Ship Emma
##>1.4 2021-11-23 20:56:10 300 2021-11-23 20:55:47 Ship Emma
##>1.5 2021-11-23 20:56:11 10 2021-11-23 20:55:47 Ship Emma
##>5 2021-11-23 22:13:56 1000 2021-11-23 22:16:01 Ship Amy
##>5.1 2021-11-23 22:13:57 450 2021-11-23 22:16:01 Ship Amy
##>5.2 2021-11-23 22:13:58 950 2021-11-23 22:16:01 Ship Amy
##>5.3 2021-11-23 22:13:59 600 2021-11-23 22:16:01 Ship Amy
##>12 2021-11-24 03:23:21 100 2021-11-24 03:23:37 Ship Sally
##>12.1 2021-11-24 03:23:22 750 2021-11-24 03:23:37 Ship Sally
##>12.2 2021-11-24 03:23:23 150 2021-11-24 03:23:37 Ship Sally
##>12.3 2021-11-24 03:23:24 200 2021-11-24 03:23:37 Ship Sally
##>12.4 2021-11-24 03:23:25 300 2021-11-24 03:23:37 Ship Sally
##>12.5 2021-11-24 03:24:34 400 2021-11-24 03:23:37 Ship Sally
##>12.6 2021-11-24 03:24:35 900 2021-11-24 03:23:37 Ship Sally
##>12.7 2021-11-24 03:24:36 1020 2021-11-24 03:23:37 Ship Sally
##>12.8 2021-11-24 03:24:37 800 2021-11-24 03:23:37 Ship Sally
The second approach involves first calculating the cartesian product of all the rows of df1
and df2
and then selecting only the rows with the minimum distance. The trick here is to use inner_join(..., by =character())
to get all the combinations of the two dataframes :
mutate(df1, id = row_number()) %>%
inner_join(mutate(df2, id = row_number()),by = character()) |>
mutate(dist = abs(date.x - date.y)) |>
group_by(id.x) |>
filter(dist == min(dist)) |>
select(-id.x, -id.y, -dist)
##>+ # A tibble: 19 × 7
##># Groups: id.x [19]
##> date.x value.x id.x date.y value.y id.y dist
##> <dttm> <dbl> <int> <dttm> <chr> <int> <drtn>
##> 1 2021-11-23 20:56:06 500 1 2021-11-23 20:55:47 Ship Emma 1 19 s…
##> 2 2021-11-23 20:56:07 900 2 2021-11-23 20:55:47 Ship Emma 1 20 s…
##> 3 2021-11-23 20:56:08 1000 3 2021-11-23 20:55:47 Ship Emma 1 21 s…
##> 4 2021-11-23 20:56:09 200 4 2021-11-23 20:55:47 Ship Emma 1 22 s…
##> 5 2021-11-23 20:56:10 300 5 2021-11-23 20:55:47 Ship Emma 1 23 s…
##> 6 2021-11-23 20:56:11 10 6 2021-11-23 20:55:47 Ship Emma 1 24 s…
##> 7 2021-11-23 22:13:56 1000 7 2021-11-23 22:16:01 Ship Amy 5 125 s…
##> 8 2021-11-23 22:13:57 450 8 2021-11-23 22:16:01 Ship Amy 5 124 s…
##> 9 2021-11-23 22:13:58 950 9 2021-11-23 22:16:01 Ship Amy 5 123 s…
##>10 2021-11-23 22:13:59 600 10 2021-11-23 22:16:01 Ship Amy 5 122 s…
##>11 2021-11-24 03:23:21 100 11 2021-11-24 03:23:37 Ship Sally 12 16 s…
##>12 2021-11-24 03:23:22 750 12 2021-11-24 03:23:37 Ship Sally 12 15 s…
##>13 2021-11-24 03:23:23 150 13 2021-11-24 03:23:37 Ship Sally 12 14 s…
##>14 2021-11-24 03:23:24 200 14 2021-11-24 03:23:37 Ship Sally 12 13 s…
##>15 2021-11-24 03:23:25 300 15 2021-11-24 03:23:37 Ship Sally 12 12 s…
##>16 2021-11-24 03:24:34 400 16 2021-11-24 03:23:37 Ship Sally 12 57 s…
##>17 2021-11-24 03:24:35 900 17 2021-11-24 03:23:37 Ship Sally 12 58 s…
##>18 2021-11-24 03:24:36 1020 18 2021-11-24 03:23:37 Ship Sally 12 59 s…
##>19 2021-11-24 03:24:37 800 19 2021-11-24 03:23:37 Ship Sally 12 60 s…
Join two dataframes by the closest datetime
Pandas has a handy merge_asof()
function for these types of problems (https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html)
It requires a single key to merge on, so you can create a single date-time column in each dataframe and perform the merge:
df_A['date_time'] = pd.to_datetime(df_A.date_A + " " + df_A.time_A)
df_B['date_time'] = pd.to_datetime(df_B.date_B + " " + df_B.time_B)
# Sort the two dataframes by the new key, as required by merge_asof function
df_A.sort_values(by="date_time", inplace=True, ignore_index=True)
df_B.sort_values(by="date_time", inplace=True, ignore_index=True)
result_df = pd.merge_asof(df_A, df_B, on="date_time", direction="nearest")
Note the direction argument's value is "nearest" as you requested. There are other values you can choose, like "backward" and "forward".
Python: Join two dataframes based on hour and nearest minute of date index
Try pd.merge_asof()
(assuming the index of DateTime type and sorted):
print(
pd.merge_asof(
df1,
df2,
left_index=True,
right_index=True,
direction="nearest",
)
)
Prints:
price_x price_y
date
2022-01-01 00:37:57 10 3000
2022-01-01 03:49:12 13 3027
2022-01-01 09:30:11 12 3021
Joining two data frames on the closest date in R
You were almost there.
In the DT[i,on]
syntax, i
should be survey
to join on all its rows
setDT(survey)
setDT(price)
survey_price <- price[survey,on=.(date=actual.date),roll="nearest"]
survey_price
date price.var1 price.var2 ID
<IDat> <num> <num> <int>
1: 2012-09-26 4.100958 4.147176 20120377
2: 2020-11-23 2.747339 2.739948 2020455822
3: 2012-10-26 4.100958 4.147176 20126758
4: 2012-10-25 4.100958 4.147176 20124241
5: 2020-11-28 2.747339 2.739948 2020426572
Pandas Merge on Name and Closest Date
I'd also love to see the final solution you came up with to know how it shook out in the end.
One thing you can do to find the closest date might be something to calc the number of days between each date in the first DataFrame and the dates in the second DataFrame. Then you can use np.argmin
to retrieve the date with the smallest time delta.
For example:
Setup
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO
Data
a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""
b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""
look at data
df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df1.head()
timepoint measure
0 2014-01-01 78
1 2014-01-02 29
2 2014-01-03 5
3 2014-01-04 73
4 2014-01-05 40
df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])
df2.head()
timepoint measure
0 2014-01-01 78
1 2014-01-08 29
2 2014-01-15 5
3 2014-01-22 73
4 2014-01-29 40
Func to find the closest date to a given date
def find_closest_date(timepoint, time_series, add_time_delta_column=True):
# takes a pd.Timestamp() instance and a pd.Series with dates in it
# calcs the delta between `timepoint` and each date in `time_series`
# returns the closest date and optionally the number of days in its time delta
deltas = np.abs(time_series - timepoint)
idx_closest_date = np.argmin(deltas)
res = {"closest_date": time_series.ix[idx_closest_date]}
idx = ['closest_date']
if add_time_delta_column:
res["closest_delta"] = deltas[idx_closest_date]
idx.append('closest_delta')
return pd.Series(res, index=idx)
df1[['closest', 'days_bt_x_and_y']] = df1.timepoint.apply(
find_closest_date, args=[df2.timepoint])
df1.head(10)
timepoint measure closest days_bt_x_and_y
0 2014-01-01 78 2014-01-01 0 days
1 2014-01-02 29 2014-01-01 1 days
2 2014-01-03 5 2014-01-01 2 days
3 2014-01-04 73 2014-01-01 3 days
4 2014-01-05 40 2014-01-08 3 days
5 2014-01-06 45 2014-01-08 2 days
6 2014-01-07 48 2014-01-08 1 days
7 2014-01-08 2 2014-01-08 0 days
8 2014-01-09 96 2014-01-08 1 days
9 2014-01-10 82 2014-01-08 2 days
Merge the two DataFrames on the new closest
date column
df3 = pd.merge(df1, df2, left_on=['closest'], right_on=['timepoint'])
colorder = [
'timepoint_x',
'closest',
'timepoint_y',
'days_bt_x_and_y',
'measure_x',
'measure_y'
]
df3 = df3.ix[:, colorder]
df3
timepoint_x closest timepoint_y days_bt_x_and_y measure_x measure_y
0 2014-01-01 2014-01-01 2014-01-01 0 days 78 78
1 2014-01-02 2014-01-01 2014-01-01 1 days 29 78
2 2014-01-03 2014-01-01 2014-01-01 2 days 5 78
3 2014-01-04 2014-01-01 2014-01-01 3 days 73 78
4 2014-01-05 2014-01-08 2014-01-08 3 days 40 29
5 2014-01-06 2014-01-08 2014-01-08 2 days 45 29
6 2014-01-07 2014-01-08 2014-01-08 1 days 48 29
7 2014-01-08 2014-01-08 2014-01-08 0 days 2 29
8 2014-01-09 2014-01-08 2014-01-08 1 days 96 29
9 2014-01-10 2014-01-08 2014-01-08 2 days 82 29
10 2014-01-11 2014-01-08 2014-01-08 3 days 61 29
11 2014-01-12 2014-01-15 2014-01-15 3 days 68 5
12 2014-01-13 2014-01-15 2014-01-15 2 days 8 5
13 2014-01-14 2014-01-15 2014-01-15 1 days 94 5
14 2014-01-15 2014-01-15 2014-01-15 0 days 16 5
15 2014-01-16 2014-01-15 2014-01-15 1 days 31 5
16 2014-01-17 2014-01-15 2014-01-15 2 days 10 5
17 2014-01-18 2014-01-15 2014-01-15 3 days 34 5
18 2014-01-19 2014-01-22 2014-01-22 3 days 27 73
19 2014-01-20 2014-01-22 2014-01-22 2 days 58 73
20 2014-01-21 2014-01-22 2014-01-22 1 days 90 73
21 2014-01-22 2014-01-22 2014-01-22 0 days 41 73
22 2014-01-23 2014-01-22 2014-01-22 1 days 97 73
23 2014-01-24 2014-01-22 2014-01-22 2 days 7 73
24 2014-01-25 2014-01-22 2014-01-22 3 days 86 73
25 2014-01-26 2014-01-29 2014-01-29 3 days 62 40
26 2014-01-27 2014-01-29 2014-01-29 2 days 91 40
27 2014-01-28 2014-01-29 2014-01-29 1 days 0 40
28 2014-01-29 2014-01-29 2014-01-29 0 days 73 40
29 2014-01-30 2014-01-29 2014-01-29 1 days 22 40
30 2014-01-31 2014-01-29 2014-01-29 2 days 43 40
31 2014-02-01 2014-01-29 2014-01-29 3 days 87 40
32 2014-02-02 2014-02-05 2014-02-05 3 days 56 45
33 2014-02-03 2014-02-05 2014-02-05 2 days 45 45
34 2014-02-04 2014-02-05 2014-02-05 1 days 25 45
35 2014-02-05 2014-02-05 2014-02-05 0 days 92 45
36 2014-02-06 2014-02-05 2014-02-05 1 days 83 45
37 2014-02-07 2014-02-05 2014-02-05 2 days 13 45
38 2014-02-08 2014-02-05 2014-02-05 3 days 50 45
39 2014-02-09 2014-02-12 2014-02-12 3 days 48 48
40 2014-02-10 2014-02-12 2014-02-12 2 days 78 48
How to merge two data frames based on nearest date
I don't think there's a quick, one-line way to do this kind of thing but I belive the best approach is to do it this way:
add a column to
df1
with the closest date from the appropriate group indf2
call a standard merge on these
As the size of your data grows, this "closest date" operation can become rather expensive unless you do something sophisticated. I like to use scikit-learn's NearestNeighbor
code for this sort of thing.
I've put together one approach to that solution that should scale relatively well.
First we can generate some simple data:
import pandas as pd
import numpy as np
dates = pd.date_range('2015', periods=200, freq='D')
rand = np.random.RandomState(42)
i1 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
i2 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
df1 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
'Date': dates[i1],
'val1':rand.rand(5)})
df2 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
'Date': dates[i2],
'val2':rand.rand(5)})
Let's check these out:
>>> df1
Code Date val1
0 0 2015-01-16 0.975852
1 0 2015-01-31 0.516300
2 1 2015-04-06 0.322956
3 1 2015-05-09 0.795186
4 1 2015-06-08 0.270832
>>> df2
Code Date val2
0 1 2015-02-03 0.184334
1 1 2015-04-13 0.080873
2 0 2015-05-02 0.428314
3 1 2015-06-26 0.688500
4 0 2015-06-30 0.058194
Now let's write an apply
function that adds a column of nearest dates to df1
using scikit-learn:
from sklearn.neighbors import NearestNeighbors
def find_nearest(group, match, groupname):
match = match[match[groupname] == group.name]
nbrs = NearestNeighbors(1).fit(match['Date'].values[:, None])
dist, ind = nbrs.kneighbors(group['Date'].values[:, None])
group['Date1'] = group['Date']
group['Date'] = match['Date'].values[ind.ravel()]
return group
df1_mod = df1.groupby('Code').apply(find_nearest, df2, 'Code')
>>> df1_mod
Code Date val1 Date1
0 0 2015-05-02 0.975852 2015-01-16
1 0 2015-05-02 0.516300 2015-01-31
2 1 2015-04-13 0.322956 2015-04-06
3 1 2015-04-13 0.795186 2015-05-09
4 1 2015-06-26 0.270832 2015-06-08
Finally, we can merge these together with a straightforward call to pd.merge
:
>>> pd.merge(df1_mod, df2, on=['Code', 'Date'])
Code Date val1 Date1 val2
0 0 2015-05-02 0.975852 2015-01-16 0.428314
1 0 2015-05-02 0.516300 2015-01-31 0.428314
2 1 2015-04-13 0.322956 2015-04-06 0.080873
3 1 2015-04-13 0.795186 2015-05-09 0.080873
4 1 2015-06-26 0.270832 2015-06-08 0.688500
Notice that rows 0 and 1 both matched the same val2
; this is expected given the way you described your desired solution.
Python - Join two dataframes based on closest date match and additional column
My output is a bit different in merge_asof
method:
df1['alert_dt'] = pd.to_datetime(df1['alert_dt'], dayfirst=True)
df2['inv_dt'] = pd.to_datetime(df2['inv_dt'], dayfirst=True)
df = pd.merge_asof(df2.sort_values('inv_dt'),
df1.sort_values('alert_dt'),
left_on='inv_dt',
right_on='alert_dt',
by='pty')
print (df)
inv_dt pty alert_dt
0 2020-06-07 A 2020-06-01
1 2020-06-14 A 2020-06-12
2 2020-06-27 A 2020-06-27
3 2020-07-12 B 2020-06-12
4 2020-08-15 B 2020-07-15
Joining two data frames in R, by nearest date and one other variable
You could add region to on
condition:
price[survey,.(region,price,x.price_date,survey_date),on=.(region,price_date=survey_date),roll="nearest"][]
region price x.price_date survey_date
<char> <num> <Date> <Date>
1: A 0.88010416 1999-10-26 1999-09-05
2: B 0.31026217 1999-12-25 1999-12-31
3: A -1.68732684 1999-04-27 1999-06-02
4: C 0.00500695 1999-08-09 1999-05-11
5: C 0.00500695 1999-08-09 1999-03-24
6: B -0.03763026 1999-08-12 1999-09-02
7: C -0.64701901 1999-12-24 2000-01-01
8: B -0.03763026 1999-08-12 1999-08-06
9: C 0.00500695 1999-08-09 1999-08-03
10: A 0.88010416 1999-10-26 1999-11-22
11: C 1.37001035 1999-09-15 1999-10-03
12: B 0.01831663 1999-07-01 1999-06-18
13: A -0.62743621 1999-03-20 1999-03-12
14: B 0.72397606 1999-02-18 1999-03-02
15: C -0.64701901 1999-12-24 1999-12-18
Note use of x.
to display LHS date.
How to join two dataframes by nearest time-date?
data.table
should work for this (can you explain the error you're coming up against?), although it does tend to convert POSIXlt to POSIXct on its own (perhaps do that conversion on your datetime column manually to keep data.table
happy). Also make sure you're setting the key column before using roll
.
(I've created my own example tables here to make my life that little bit easier. If you want to use dput on yours, I'm happy to update this example with your data):
new <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00" ) ), data.new = c( "t","u","v" ) )
head( new, 2 )
date data.new
1: 2016-03-02 12:20:00 t
2: 2016-03-07 12:20:00 u
old <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2015-03-02 12:20:00" ) ), data.old = c( "a","b","c","d" ) )
head( old, 2 )
date data.old
1: 2016-03-02 12:20:00 a
2: 2016-03-07 12:20:00 b
setkey( new, date )
setkey( old, date )
combined <- new[ old, roll = "nearest" ]
combined
date data.new data.old
1: 2015-03-02 12:20:00 t d
2: 2016-03-02 12:20:00 t a
3: 2016-03-07 12:20:00 u b
4: 2016-04-02 12:20:00 v c
I've intentionally made the two tables different row lengths, in order to show how the rolling join deals with multiple matches. You can switch the way it joins with:
combined <- old[ new, roll = "nearest" ]
combined
date data.old data.new
1: 2016-03-02 12:20:00 a t
2: 2016-03-07 12:20:00 b u
3: 2016-04-02 12:20:00 c v
Related Topics
What Does "S3 Methods" Mean in R
How to Fit a Smooth Curve to My Data in R
Factors in R: More Than an Annoyance
Select First Element of Nested List
How to Reorder Data.Table Columns (Without Copying)
Cowplot Made Ggplot2 Theme Disappear/How to See Current Ggplot2 Theme, and Restore the Default
How to Use Functions in One R Package Masked by Another Package
How to Select Last N Observation from Each Group in Dplyr Dataframe
Ggplot2 Heatmaps: Using Different Gradients for Categories
Global Variables in Packages in R
Change Background and Text of Strips Associated to Multiple Panels in R/Lattice
Calculating Mean for Every N Values from a Vector
Extract Prediction Band from Lme Fit
Code Chunk Font Size in Rmarkdown with Knitr and Latex
Same Function Over Multiple Data Frames in R
Extract Elements Common in All Column Groups
Merge Many Data Frames from CSV Files, When Id Column Is Implied