Merge Pandas Dataframes Where One Value Is Between Two Others

Merge pandas dataframes where one value is between two others

As you say, this is pretty easy in SQL, so why not do it in SQL?

import pandas as pd
import sqlite3

#We'll use firelynx's tables:
presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
"president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
"name": ["War in Afghanistan", "Iraq War"]})
#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
terms.to_sql('terms', conn, index=False)
presidents.to_sql('presidents', conn, index=False)
war_declarations.to_sql('wars', conn, index=False)

qry = '''
select
start_date PresTermStart,
end_date PresTermEnd,
wars.date WarStart,
presidents.name Pres
from
terms join wars on
date between start_date and end_date join presidents on
terms.president_id = presidents.president_id
'''
df = pd.read_sql_query(qry, conn)

df:

         PresTermStart          PresTermEnd             WarStart  Pres
0 2001-01-31 00:00:00 2005-01-31 00:00:00 2001-09-14 00:00:00 Bush
1 2001-01-31 00:00:00 2005-01-31 00:00:00 2003-03-03 00:00:00 Bush

Merging two dataframes based on a date between two other dates without a common column

Create data and format to datetimes:

df_A = pd.DataFrame({'start_date':['2017-03-27','2017-01-10'],'end_date':['2017-04-20','2017-02-01']})
df_B = pd.DataFrame({'event_date':['2017-01-20','2017-01-27'],'price':[100,200]})

df_A['end_date'] = pd.to_datetime(df_A.end_date)
df_A['start_date'] = pd.to_datetime(df_A.start_date)
df_B['event_date'] = pd.to_datetime(df_B.event_date)

Create keys to do a cross join:

New in pandas 1.2.0+ how='cross' instead of assigning psuedo keys:

df_merge = df_A.merge(df_B, how='cross')

Else, with pandas < 1.2.0 use psuedo key to merge on 'key'

df_A = df_A.assign(key=1)
df_B = df_B.assign(key=1)
df_merge = pd.merge(df_A, df_B, on='key').drop('key',axis=1)

Filter out records that do not meet criteria of event dates between start and end dates:

df_merge = df_merge.query('event_date >= start_date and event_date <= end_date')

Join back to original date range table and drop key column

df_out = df_A.merge(df_merge, on=['start_date','end_date'], how='left').fillna('').drop('key', axis=1)

print(df_out)

Output:

              end_date           start_date           event_date price
0 2017-04-20 00:00:00 2017-03-27 00:00:00
1 2017-02-01 00:00:00 2017-01-10 00:00:00 2017-01-20 00:00:00 100
2 2017-02-01 00:00:00 2017-01-10 00:00:00 2017-01-27 00:00:00 200

Best way to join / merge by range in pandas

Setup

Consider the dataframes A and B

A = pd.DataFrame(dict(
A_id=range(10),
A_value=range(5, 105, 10)
))
B = pd.DataFrame(dict(
B_id=range(5),
B_low=[0, 30, 30, 46, 84],
B_high=[10, 40, 50, 54, 84]
))

A

A_id A_value
0 0 5
1 1 15
2 2 25
3 3 35
4 4 45
5 5 55
6 6 65
7 7 75
8 8 85
9 9 95

B

B_high B_id B_low
0 10 0 0
1 40 1 30
2 50 2 30
3 54 3 46
4 84 4 84

numpy

The ✌easiest✌ way is to use numpy broadcasting.

We look for every instance of A_value being greater than or equal to B_low while at the same time A_value is less than or equal to B_high.

a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values

i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))

pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1)

A_id A_value B_high B_id B_low
0 0 5 10 0 0
1 3 35 40 1 30
2 3 35 50 2 30
3 4 45 50 2 30

To address the comments and give something akin to a left join, I appended the part of A that doesn't match.

pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1).append(
A[~np.in1d(np.arange(len(A)), np.unique(i))],
ignore_index=True, sort=False
)

A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 3 35 1.0 30.0 40.0
2 3 35 2.0 30.0 50.0
3 4 45 2.0 30.0 50.0
4 1 15 NaN NaN NaN
5 2 25 NaN NaN NaN
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN

Value between two values of another df in pandas

If your table isn't to big (this merge creates a cartesian product), you merge and then filter:

# Merge on Key1
dfm = df1.merge(df2, on='Key1')

# Filter on value in range of initial and final
df1['Key2'] = dfm.loc[(dfm['Value'] >= dfm['Value Initial']) & (dfm['Value'] <= dfm['Value Final']), 'Key2']

df1

Output:

   Value  Key1 Key2
0 10 55 Y
1 20 55 Y
2 30 35 Z
3 40 35 Z

How to join two dataframes when only some dates in one dataframe is present between two other dates in other dataframe?

If your start_date and end_date do not overlap, create an interval index and merge your two dataframes:

bins = pd.IntervalIndex.from_arrays(df_A['start_date'], 
df_A['end_date'],
closed='both')

out = df_B.assign(interval=pd.cut(df_B['event_date'], bins)) \
.merge(df_A.assign(interval=bins), on='interval', how='left')

print(out[['event_date', 'price', 'start_date']])

# Output:
event_date price start_date
0 2021-04-01 00:06:00 100 2021-04-01
1 2021-05-01 00:03:00 200 2021-05-01
2 2021-05-04 00:00:00 500 NaT

Pandas DataFrame merge between two values instead of matching one

I ended up realizing I was over thinking this I added a column called merge to both tables which was just all 1's

then I can merge on that column and do regular boolean filters on the resulting merged table.

a["merge"] = 1
b["merge"] = 1
c = a.merge(b, on="merge")

then filter on c

Pandas merge two dataframes based on one column from one table lies in between two columns from another table

Quick and dirty way:

countries = []
for i in range(len(df1)):
ip = df1.loc[i, 'ip']
country = df2.query("low_ip <= @ip <= high_ip")['country'].to_numpy()

if len(country) > 0:
countries.append(country[0])
else:
countries.append('NA')

df1['country'] = countries

print(df1)

ip country
0 0.1 NA
1 2.5 B
2 3.5 A



Related Topics



Leave a reply



Submit