How to Join Two Dataframes by Nearest Time-Date

Merge two dataframes by nearest date in R

I suggest two approaches. The first uses a distance matrix and perform a left_join of df1 to df2. Namely the distance matrix is given by:

dateDist <- outer(pull(df1, date), pull(df2, date), "-") %>%
abs()

Next, for each row of df1, the row of df2 with closest distance is given by:

  closest.df1 <- apply(dateDist, 1, which.min)

Finally, the merge is performed manually:

cbind(rename_with(df1, ~paste0("df1.", "", .x)),
rename_with(df2[closest.df1,], ~paste0("df2.", "", .x)))

##>+ df1.date df1.value df2.date df2.value
##>1 2021-11-23 20:56:06 500 2021-11-23 20:55:47 Ship Emma
##>1.1 2021-11-23 20:56:07 900 2021-11-23 20:55:47 Ship Emma
##>1.2 2021-11-23 20:56:08 1000 2021-11-23 20:55:47 Ship Emma
##>1.3 2021-11-23 20:56:09 200 2021-11-23 20:55:47 Ship Emma
##>1.4 2021-11-23 20:56:10 300 2021-11-23 20:55:47 Ship Emma
##>1.5 2021-11-23 20:56:11 10 2021-11-23 20:55:47 Ship Emma
##>5 2021-11-23 22:13:56 1000 2021-11-23 22:16:01 Ship Amy
##>5.1 2021-11-23 22:13:57 450 2021-11-23 22:16:01 Ship Amy
##>5.2 2021-11-23 22:13:58 950 2021-11-23 22:16:01 Ship Amy
##>5.3 2021-11-23 22:13:59 600 2021-11-23 22:16:01 Ship Amy
##>12 2021-11-24 03:23:21 100 2021-11-24 03:23:37 Ship Sally
##>12.1 2021-11-24 03:23:22 750 2021-11-24 03:23:37 Ship Sally
##>12.2 2021-11-24 03:23:23 150 2021-11-24 03:23:37 Ship Sally
##>12.3 2021-11-24 03:23:24 200 2021-11-24 03:23:37 Ship Sally
##>12.4 2021-11-24 03:23:25 300 2021-11-24 03:23:37 Ship Sally
##>12.5 2021-11-24 03:24:34 400 2021-11-24 03:23:37 Ship Sally
##>12.6 2021-11-24 03:24:35 900 2021-11-24 03:23:37 Ship Sally
##>12.7 2021-11-24 03:24:36 1020 2021-11-24 03:23:37 Ship Sally
##>12.8 2021-11-24 03:24:37 800 2021-11-24 03:23:37 Ship Sally

The second approach involves first calculating the cartesian product of all the rows of df1 and df2 and then selecting only the rows with the minimum distance. The trick here is to use inner_join(..., by =character()) to get all the combinations of the two dataframes :

mutate(df1, id = row_number()) %>%
inner_join(mutate(df2, id = row_number()),by = character()) |>
mutate(dist = abs(date.x - date.y)) |>
group_by(id.x) |>
filter(dist == min(dist)) |>
select(-id.x, -id.y, -dist)

##>+ # A tibble: 19 × 7
##># Groups: id.x [19]
##> date.x value.x id.x date.y value.y id.y dist
##> <dttm> <dbl> <int> <dttm> <chr> <int> <drtn>
##> 1 2021-11-23 20:56:06 500 1 2021-11-23 20:55:47 Ship Emma 1 19 s…
##> 2 2021-11-23 20:56:07 900 2 2021-11-23 20:55:47 Ship Emma 1 20 s…
##> 3 2021-11-23 20:56:08 1000 3 2021-11-23 20:55:47 Ship Emma 1 21 s…
##> 4 2021-11-23 20:56:09 200 4 2021-11-23 20:55:47 Ship Emma 1 22 s…
##> 5 2021-11-23 20:56:10 300 5 2021-11-23 20:55:47 Ship Emma 1 23 s…
##> 6 2021-11-23 20:56:11 10 6 2021-11-23 20:55:47 Ship Emma 1 24 s…
##> 7 2021-11-23 22:13:56 1000 7 2021-11-23 22:16:01 Ship Amy 5 125 s…
##> 8 2021-11-23 22:13:57 450 8 2021-11-23 22:16:01 Ship Amy 5 124 s…
##> 9 2021-11-23 22:13:58 950 9 2021-11-23 22:16:01 Ship Amy 5 123 s…
##>10 2021-11-23 22:13:59 600 10 2021-11-23 22:16:01 Ship Amy 5 122 s…
##>11 2021-11-24 03:23:21 100 11 2021-11-24 03:23:37 Ship Sally 12 16 s…
##>12 2021-11-24 03:23:22 750 12 2021-11-24 03:23:37 Ship Sally 12 15 s…
##>13 2021-11-24 03:23:23 150 13 2021-11-24 03:23:37 Ship Sally 12 14 s…
##>14 2021-11-24 03:23:24 200 14 2021-11-24 03:23:37 Ship Sally 12 13 s…
##>15 2021-11-24 03:23:25 300 15 2021-11-24 03:23:37 Ship Sally 12 12 s…
##>16 2021-11-24 03:24:34 400 16 2021-11-24 03:23:37 Ship Sally 12 57 s…
##>17 2021-11-24 03:24:35 900 17 2021-11-24 03:23:37 Ship Sally 12 58 s…
##>18 2021-11-24 03:24:36 1020 18 2021-11-24 03:23:37 Ship Sally 12 59 s…
##>19 2021-11-24 03:24:37 800 19 2021-11-24 03:23:37 Ship Sally 12 60 s…

Join two dataframes by the closest datetime

Pandas has a handy merge_asof() function for these types of problems (https://pandas.pydata.org/docs/reference/api/pandas.merge_asof.html)

It requires a single key to merge on, so you can create a single date-time column in each dataframe and perform the merge:

df_A['date_time'] = pd.to_datetime(df_A.date_A + " " + df_A.time_A)
df_B['date_time'] = pd.to_datetime(df_B.date_B + " " + df_B.time_B)

# Sort the two dataframes by the new key, as required by merge_asof function
df_A.sort_values(by="date_time", inplace=True, ignore_index=True)
df_B.sort_values(by="date_time", inplace=True, ignore_index=True)

result_df = pd.merge_asof(df_A, df_B, on="date_time", direction="nearest")

Note the direction argument's value is "nearest" as you requested. There are other values you can choose, like "backward" and "forward".

Python: Join two dataframes based on hour and nearest minute of date index

Try pd.merge_asof() (assuming the index of DateTime type and sorted):

print(
pd.merge_asof(
df1,
df2,
left_index=True,
right_index=True,
direction="nearest",
)
)

Prints:

                     price_x  price_y
date
2022-01-01 00:37:57 10 3000
2022-01-01 03:49:12 13 3027
2022-01-01 09:30:11 12 3021

Joining two data frames on the closest date in R

You were almost there.

In the DT[i,on] syntax, i should be survey to join on all its rows

setDT(survey)
setDT(price)
survey_price <- price[survey,on=.(date=actual.date),roll="nearest"]
survey_price

date price.var1 price.var2 ID
<IDat> <num> <num> <int>
1: 2012-09-26 4.100958 4.147176 20120377
2: 2020-11-23 2.747339 2.739948 2020455822
3: 2012-10-26 4.100958 4.147176 20126758
4: 2012-10-25 4.100958 4.147176 20124241
5: 2020-11-28 2.747339 2.739948 2020426572

Pandas Merge on Name and Closest Date

I'd also love to see the final solution you came up with to know how it shook out in the end.

One thing you can do to find the closest date might be something to calc the number of days between each date in the first DataFrame and the dates in the second DataFrame. Then you can use np.argmin to retrieve the date with the smallest time delta.

For example:

Setup

#!/usr/bin/env python
# -*- coding: utf-8 -*-
import numpy as np
import pandas as pd
from pandas.io.parsers import StringIO

Data

a = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-02 00:00:00,29
2014-01-03 00:00:00,5
2014-01-04 00:00:00,73
2014-01-05 00:00:00,40
2014-01-06 00:00:00,45
2014-01-07 00:00:00,48
2014-01-08 00:00:00,2
2014-01-09 00:00:00,96
2014-01-10 00:00:00,82
2014-01-11 00:00:00,61
2014-01-12 00:00:00,68
2014-01-13 00:00:00,8
2014-01-14 00:00:00,94
2014-01-15 00:00:00,16
2014-01-16 00:00:00,31
2014-01-17 00:00:00,10
2014-01-18 00:00:00,34
2014-01-19 00:00:00,27
2014-01-20 00:00:00,58
2014-01-21 00:00:00,90
2014-01-22 00:00:00,41
2014-01-23 00:00:00,97
2014-01-24 00:00:00,7
2014-01-25 00:00:00,86
2014-01-26 00:00:00,62
2014-01-27 00:00:00,91
2014-01-28 00:00:00,0
2014-01-29 00:00:00,73
2014-01-30 00:00:00,22
2014-01-31 00:00:00,43
2014-02-01 00:00:00,87
2014-02-02 00:00:00,56
2014-02-03 00:00:00,45
2014-02-04 00:00:00,25
2014-02-05 00:00:00,92
2014-02-06 00:00:00,83
2014-02-07 00:00:00,13
2014-02-08 00:00:00,50
2014-02-09 00:00:00,48
2014-02-10 00:00:00,78"""

b = """timepoint,measure
2014-01-01 00:00:00,78
2014-01-08 00:00:00,29
2014-01-15 00:00:00,5
2014-01-22 00:00:00,73
2014-01-29 00:00:00,40
2014-02-05 00:00:00,45
2014-02-12 00:00:00,48
2014-02-19 00:00:00,2
2014-02-26 00:00:00,96
2014-03-05 00:00:00,82
2014-03-12 00:00:00,61
2014-03-19 00:00:00,68
2014-03-26 00:00:00,8
2014-04-02 00:00:00,94
"""

look at data

df1 = pd.read_csv(StringIO(a), parse_dates=['timepoint'])
df1.head()

timepoint measure
0 2014-01-01 78
1 2014-01-02 29
2 2014-01-03 5
3 2014-01-04 73
4 2014-01-05 40

df2 = pd.read_csv(StringIO(b), parse_dates=['timepoint'])
df2.head()

timepoint measure
0 2014-01-01 78
1 2014-01-08 29
2 2014-01-15 5
3 2014-01-22 73
4 2014-01-29 40

Func to find the closest date to a given date

def find_closest_date(timepoint, time_series, add_time_delta_column=True):
# takes a pd.Timestamp() instance and a pd.Series with dates in it
# calcs the delta between `timepoint` and each date in `time_series`
# returns the closest date and optionally the number of days in its time delta
deltas = np.abs(time_series - timepoint)
idx_closest_date = np.argmin(deltas)
res = {"closest_date": time_series.ix[idx_closest_date]}
idx = ['closest_date']
if add_time_delta_column:
res["closest_delta"] = deltas[idx_closest_date]
idx.append('closest_delta')
return pd.Series(res, index=idx)

df1[['closest', 'days_bt_x_and_y']] = df1.timepoint.apply(
find_closest_date, args=[df2.timepoint])
df1.head(10)

timepoint measure closest days_bt_x_and_y
0 2014-01-01 78 2014-01-01 0 days
1 2014-01-02 29 2014-01-01 1 days
2 2014-01-03 5 2014-01-01 2 days
3 2014-01-04 73 2014-01-01 3 days
4 2014-01-05 40 2014-01-08 3 days
5 2014-01-06 45 2014-01-08 2 days
6 2014-01-07 48 2014-01-08 1 days
7 2014-01-08 2 2014-01-08 0 days
8 2014-01-09 96 2014-01-08 1 days
9 2014-01-10 82 2014-01-08 2 days

Merge the two DataFrames on the new closest date column

df3 = pd.merge(df1, df2, left_on=['closest'], right_on=['timepoint'])

colorder = [
'timepoint_x',
'closest',
'timepoint_y',
'days_bt_x_and_y',
'measure_x',
'measure_y'
]

df3 = df3.ix[:, colorder]
df3

timepoint_x closest timepoint_y days_bt_x_and_y measure_x measure_y
0 2014-01-01 2014-01-01 2014-01-01 0 days 78 78
1 2014-01-02 2014-01-01 2014-01-01 1 days 29 78
2 2014-01-03 2014-01-01 2014-01-01 2 days 5 78
3 2014-01-04 2014-01-01 2014-01-01 3 days 73 78
4 2014-01-05 2014-01-08 2014-01-08 3 days 40 29
5 2014-01-06 2014-01-08 2014-01-08 2 days 45 29
6 2014-01-07 2014-01-08 2014-01-08 1 days 48 29
7 2014-01-08 2014-01-08 2014-01-08 0 days 2 29
8 2014-01-09 2014-01-08 2014-01-08 1 days 96 29
9 2014-01-10 2014-01-08 2014-01-08 2 days 82 29
10 2014-01-11 2014-01-08 2014-01-08 3 days 61 29
11 2014-01-12 2014-01-15 2014-01-15 3 days 68 5
12 2014-01-13 2014-01-15 2014-01-15 2 days 8 5
13 2014-01-14 2014-01-15 2014-01-15 1 days 94 5
14 2014-01-15 2014-01-15 2014-01-15 0 days 16 5
15 2014-01-16 2014-01-15 2014-01-15 1 days 31 5
16 2014-01-17 2014-01-15 2014-01-15 2 days 10 5
17 2014-01-18 2014-01-15 2014-01-15 3 days 34 5
18 2014-01-19 2014-01-22 2014-01-22 3 days 27 73
19 2014-01-20 2014-01-22 2014-01-22 2 days 58 73
20 2014-01-21 2014-01-22 2014-01-22 1 days 90 73
21 2014-01-22 2014-01-22 2014-01-22 0 days 41 73
22 2014-01-23 2014-01-22 2014-01-22 1 days 97 73
23 2014-01-24 2014-01-22 2014-01-22 2 days 7 73
24 2014-01-25 2014-01-22 2014-01-22 3 days 86 73
25 2014-01-26 2014-01-29 2014-01-29 3 days 62 40
26 2014-01-27 2014-01-29 2014-01-29 2 days 91 40
27 2014-01-28 2014-01-29 2014-01-29 1 days 0 40
28 2014-01-29 2014-01-29 2014-01-29 0 days 73 40
29 2014-01-30 2014-01-29 2014-01-29 1 days 22 40
30 2014-01-31 2014-01-29 2014-01-29 2 days 43 40
31 2014-02-01 2014-01-29 2014-01-29 3 days 87 40
32 2014-02-02 2014-02-05 2014-02-05 3 days 56 45
33 2014-02-03 2014-02-05 2014-02-05 2 days 45 45
34 2014-02-04 2014-02-05 2014-02-05 1 days 25 45
35 2014-02-05 2014-02-05 2014-02-05 0 days 92 45
36 2014-02-06 2014-02-05 2014-02-05 1 days 83 45
37 2014-02-07 2014-02-05 2014-02-05 2 days 13 45
38 2014-02-08 2014-02-05 2014-02-05 3 days 50 45
39 2014-02-09 2014-02-12 2014-02-12 3 days 48 48
40 2014-02-10 2014-02-12 2014-02-12 2 days 78 48

How to merge two data frames based on nearest date

I don't think there's a quick, one-line way to do this kind of thing but I belive the best approach is to do it this way:

  1. add a column to df1 with the closest date from the appropriate group in df2

  2. call a standard merge on these

As the size of your data grows, this "closest date" operation can become rather expensive unless you do something sophisticated. I like to use scikit-learn's NearestNeighbor code for this sort of thing.

I've put together one approach to that solution that should scale relatively well.
First we can generate some simple data:

import pandas as pd
import numpy as np
dates = pd.date_range('2015', periods=200, freq='D')

rand = np.random.RandomState(42)
i1 = np.sort(rand.permutation(np.arange(len(dates)))[:5])
i2 = np.sort(rand.permutation(np.arange(len(dates)))[:5])

df1 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
'Date': dates[i1],
'val1':rand.rand(5)})
df2 = pd.DataFrame({'Code': rand.randint(0, 2, 5),
'Date': dates[i2],
'val2':rand.rand(5)})

Let's check these out:

>>> df1
Code Date val1
0 0 2015-01-16 0.975852
1 0 2015-01-31 0.516300
2 1 2015-04-06 0.322956
3 1 2015-05-09 0.795186
4 1 2015-06-08 0.270832

>>> df2
Code Date val2
0 1 2015-02-03 0.184334
1 1 2015-04-13 0.080873
2 0 2015-05-02 0.428314
3 1 2015-06-26 0.688500
4 0 2015-06-30 0.058194

Now let's write an apply function that adds a column of nearest dates to df1 using scikit-learn:

from sklearn.neighbors import NearestNeighbors

def find_nearest(group, match, groupname):
match = match[match[groupname] == group.name]
nbrs = NearestNeighbors(1).fit(match['Date'].values[:, None])
dist, ind = nbrs.kneighbors(group['Date'].values[:, None])

group['Date1'] = group['Date']
group['Date'] = match['Date'].values[ind.ravel()]
return group

df1_mod = df1.groupby('Code').apply(find_nearest, df2, 'Code')
>>> df1_mod
Code Date val1 Date1
0 0 2015-05-02 0.975852 2015-01-16
1 0 2015-05-02 0.516300 2015-01-31
2 1 2015-04-13 0.322956 2015-04-06
3 1 2015-04-13 0.795186 2015-05-09
4 1 2015-06-26 0.270832 2015-06-08

Finally, we can merge these together with a straightforward call to pd.merge:

>>> pd.merge(df1_mod, df2, on=['Code', 'Date'])
Code Date val1 Date1 val2
0 0 2015-05-02 0.975852 2015-01-16 0.428314
1 0 2015-05-02 0.516300 2015-01-31 0.428314
2 1 2015-04-13 0.322956 2015-04-06 0.080873
3 1 2015-04-13 0.795186 2015-05-09 0.080873
4 1 2015-06-26 0.270832 2015-06-08 0.688500

Notice that rows 0 and 1 both matched the same val2; this is expected given the way you described your desired solution.

Python - Join two dataframes based on closest date match and additional column

My output is a bit different in merge_asof method:

df1['alert_dt'] = pd.to_datetime(df1['alert_dt'], dayfirst=True)
df2['inv_dt'] = pd.to_datetime(df2['inv_dt'], dayfirst=True)


df = pd.merge_asof(df2.sort_values('inv_dt'),
df1.sort_values('alert_dt'),
left_on='inv_dt',
right_on='alert_dt',
by='pty')
print (df)
inv_dt pty alert_dt
0 2020-06-07 A 2020-06-01
1 2020-06-14 A 2020-06-12
2 2020-06-27 A 2020-06-27
3 2020-07-12 B 2020-06-12
4 2020-08-15 B 2020-07-15

Joining two data frames in R, by nearest date and one other variable

You could add region to on condition:

price[survey,.(region,price,x.price_date,survey_date),on=.(region,price_date=survey_date),roll="nearest"][]

region price x.price_date survey_date
<char> <num> <Date> <Date>
1: A 0.88010416 1999-10-26 1999-09-05
2: B 0.31026217 1999-12-25 1999-12-31
3: A -1.68732684 1999-04-27 1999-06-02
4: C 0.00500695 1999-08-09 1999-05-11
5: C 0.00500695 1999-08-09 1999-03-24
6: B -0.03763026 1999-08-12 1999-09-02
7: C -0.64701901 1999-12-24 2000-01-01
8: B -0.03763026 1999-08-12 1999-08-06
9: C 0.00500695 1999-08-09 1999-08-03
10: A 0.88010416 1999-10-26 1999-11-22
11: C 1.37001035 1999-09-15 1999-10-03
12: B 0.01831663 1999-07-01 1999-06-18
13: A -0.62743621 1999-03-20 1999-03-12
14: B 0.72397606 1999-02-18 1999-03-02
15: C -0.64701901 1999-12-24 1999-12-18

Note use of x. to display LHS date.

How to join two dataframes by nearest time-date?

data.table should work for this (can you explain the error you're coming up against?), although it does tend to convert POSIXlt to POSIXct on its own (perhaps do that conversion on your datetime column manually to keep data.table happy). Also make sure you're setting the key column before using roll.

(I've created my own example tables here to make my life that little bit easier. If you want to use dput on yours, I'm happy to update this example with your data):

new <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00" ) ), data.new = c( "t","u","v" ) )
head( new, 2 )

date data.new
1: 2016-03-02 12:20:00 t
2: 2016-03-07 12:20:00 u

old <- data.table( date = as.POSIXct( c( "2016-03-02 12:20:00", "2016-03-07 12:20:00", "2016-04-02 12:20:00", "2015-03-02 12:20:00" ) ), data.old = c( "a","b","c","d" ) )
head( old, 2 )


date data.old
1: 2016-03-02 12:20:00 a
2: 2016-03-07 12:20:00 b

setkey( new, date )
setkey( old, date )

combined <- new[ old, roll = "nearest" ]
combined

date data.new data.old
1: 2015-03-02 12:20:00 t d
2: 2016-03-02 12:20:00 t a
3: 2016-03-07 12:20:00 u b
4: 2016-04-02 12:20:00 v c

I've intentionally made the two tables different row lengths, in order to show how the rolling join deals with multiple matches. You can switch the way it joins with:

combined <- old[ new, roll = "nearest" ]
combined

date data.old data.new
1: 2016-03-02 12:20:00 a t
2: 2016-03-07 12:20:00 b u
3: 2016-04-02 12:20:00 c v


Related Topics



Leave a reply



Submit