Best way to join / merge by range in pandas
Setup
Consider the dataframes A
and B
A = pd.DataFrame(dict(
A_id=range(10),
A_value=range(5, 105, 10)
))
B = pd.DataFrame(dict(
B_id=range(5),
B_low=[0, 30, 30, 46, 84],
B_high=[10, 40, 50, 54, 84]
))
A
A_id A_value
0 0 5
1 1 15
2 2 25
3 3 35
4 4 45
5 5 55
6 6 65
7 7 75
8 8 85
9 9 95
B
B_high B_id B_low
0 10 0 0
1 40 1 30
2 50 2 30
3 54 3 46
4 84 4 84
numpy
The ✌easiest✌ way is to use numpy
broadcasting.
We look for every instance of A_value
being greater than or equal to B_low
while at the same time A_value
is less than or equal to B_high
.
a = A.A_value.values
bh = B.B_high.values
bl = B.B_low.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1)
A_id A_value B_high B_id B_low
0 0 5 10 0 0
1 3 35 40 1 30
2 3 35 50 2 30
3 4 45 50 2 30
To address the comments and give something akin to a left join, I appended the part of A
that doesn't match.
pd.concat([
A.loc[i, :].reset_index(drop=True),
B.loc[j, :].reset_index(drop=True)
], axis=1).append(
A[~np.in1d(np.arange(len(A)), np.unique(i))],
ignore_index=True, sort=False
)
A_id A_value B_id B_low B_high
0 0 5 0.0 0.0 10.0
1 3 35 1.0 30.0 40.0
2 3 35 2.0 30.0 50.0
3 4 45 2.0 30.0 50.0
4 1 15 NaN NaN NaN
5 2 25 NaN NaN NaN
6 5 55 NaN NaN NaN
7 6 65 NaN NaN NaN
8 7 75 NaN NaN NaN
9 8 85 NaN NaN NaN
10 9 95 NaN NaN NaN
Merge/Join in pandas based on date in range of min/max dates in another df
I found my way.
I added Series, and added the condition in the i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh))
of the related topic
Now I have:
a = A.A_value.values
aId = A.A_id.values
bId = B.B_id.values
bh = B.B_high.values
bl = B.B_low.values
i, j = np.where((a[:, None] >= bl) & (a[:, None] <= bh) & (aId[:, None] == bId)
This is almost instantaneous for my 80k lines whereas before it took 3 seconds
how to join two dataframes for which column values are within a certain range for multiple columns using pandas dataframe?
Solution 1: Simple Solution for small dataset
For small dataset, you can cross join df1
and df2
by .merge()
, then filter by the conditions where the Price is within range and year is within range using .query()
specifying the conditions, as follows:
(df1.merge(df2, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross'
, you can use:
(df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
Result:
Price year score
0 10 2001 20
4 70 2002 50
8 50 2010 30
Solution 2: Numpy Solution for large dataset
For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:
We look for Price
in df2
is within price range in df1
and year
in df2
is within year range in df1
:
d2_P = df2.Price.values
d2_Y = df2.year.values
d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values
i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))
pd.DataFrame(
np.column_stack([df1.values[j], df2.values[i]]),
columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]
Result:
Price year score
0 10 2001 20
1 70 2002 50
2 50 2010 30
Performance Comparison
Part 1: Compare for original datasets of 3 rows each:
Solution 1:
%%timeit
(df1.merge(df2, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
5.91 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Solution 2:
%%timeit
d2_P = df2.Price.values
d2_Y = df2.year.values
d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values
i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))
pd.DataFrame(
np.column_stack([df1.values[j], df2.values[i]]),
columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]
703 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Benchmark summary: 5.91 ms vs 703 µs, that is 8.4x times faster
Part 2: Compare for datasets with 3,000 and 30,000 rows:
Data Setup:
df1a = pd.concat([df1] * 1000, ignore_index=True)
df2a = pd.concat([df2] * 10000, ignore_index=True)
Solution 1:
%%timeit
(df1a.merge(df2a, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
27.5 s ± 3.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2:
%%timeit
d2_P = df2a.Price.values
d2_Y = df2a.year.values
d1_PS = df1a.price_start.values
d1_PE = df1a.price_end.values
d1_YS = df1a.year_start.values
d1_YE = df1a.year_end.values
i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))
pd.DataFrame(
np.column_stack([df1a.values[j], df2a.values[i]]),
columns=df1a.columns.append(df2a.columns)
)[['Price', 'year', 'score']]
3.83 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Benchmark summary: 27.5 s vs 3.83 s, that is 7.2x times faster
Merge pandas dataframes where one value is between two others
As you say, this is pretty easy in SQL, so why not do it in SQL?
import pandas as pd
import sqlite3
#We'll use firelynx's tables:
presidents = pd.DataFrame({"name": ["Bush", "Obama", "Trump"],
"president_id":[43, 44, 45]})
terms = pd.DataFrame({'start_date': pd.date_range('2001-01-20', periods=5, freq='48M'),
'end_date': pd.date_range('2005-01-21', periods=5, freq='48M'),
'president_id': [43, 43, 44, 44, 45]})
war_declarations = pd.DataFrame({"date": [datetime(2001, 9, 14), datetime(2003, 3, 3)],
"name": ["War in Afghanistan", "Iraq War"]})
#Make the db in memory
conn = sqlite3.connect(':memory:')
#write the tables
terms.to_sql('terms', conn, index=False)
presidents.to_sql('presidents', conn, index=False)
war_declarations.to_sql('wars', conn, index=False)
qry = '''
select
start_date PresTermStart,
end_date PresTermEnd,
wars.date WarStart,
presidents.name Pres
from
terms join wars on
date between start_date and end_date join presidents on
terms.president_id = presidents.president_id
'''
df = pd.read_sql_query(qry, conn)
df:
PresTermStart PresTermEnd WarStart Pres
0 2001-01-31 00:00:00 2005-01-31 00:00:00 2001-09-14 00:00:00 Bush
1 2001-01-31 00:00:00 2005-01-31 00:00:00 2003-03-03 00:00:00 Bush
Performing a merge in Pandas on a column containing a Python `range` or list-like
Since it looks like ranges are pretty big, and you are working with integer vales, you can just compute the min, max:
columns = look_up.columns
look_up['minval'] = look_up['col3'].apply(min)
look_up['maxval'] = look_up['col3'].apply(max)
(sample.merge(look_up, on=['col1','col2'], how='left',
suffixes=['','_'])
.query('minval <= col3 <= maxval')
[columns]
)
Output:
col1 col2 col3 col4
1 1b 2b 42 h
2 1a 2b 3 c
5 1a 2a 21 b
6 1b 2a 7 e
pandas merge intervals by range
Here is an answer using pyranges and pandas. It is improved in that it does the merging really quickly, is easily parallelizeable and super duper fast even in single-core mode.
Setup:
import pandas as pd
import pyranges as pr
import numpy as np
rows = int(1e7)
gr = pr.random(rows)
gr.probability = np.random.rand(rows)
gr.read = np.arange(rows)
print(gr)
# +--------------+-----------+-----------+--------------+----------------------+-----------+
# | Chromosome | Start | End | Strand | probability | read |
# | (category) | (int32) | (int32) | (category) | (float64) | (int64) |
# |--------------+-----------+-----------+--------------+----------------------+-----------|
# | chr1 | 149953099 | 149953199 | + | 0.7536048547309669 | 0 |
# | chr1 | 184344435 | 184344535 | + | 0.9358130407479777 | 1 |
# | chr1 | 238639916 | 238640016 | + | 0.024212603310159064 | 2 |
# | chr1 | 95180042 | 95180142 | + | 0.027139751993808026 | 3 |
# | ... | ... | ... | ... | ... | ... |
# | chrY | 34355323 | 34355423 | - | 0.8843190383030953 | 999996 |
# | chrY | 1818049 | 1818149 | - | 0.23138017743097572 | 999997 |
# | chrY | 10101456 | 10101556 | - | 0.3007915302642412 | 999998 |
# | chrY | 355910 | 356010 | - | 0.03694752911338561 | 999999 |
# +--------------+-----------+-----------+--------------+----------------------+-----------+
# Stranded PyRanges object has 1,000,000 rows and 6 columns from 25 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
Execution:
def praderas(df):
grpby = df.groupby("Cluster")
prob = grpby.probability.sum()
prob.name = "ProbSum"
n = grpby.read.count()
n.name = "Count"
return df.merge(prob, on="Cluster").merge(n, on="Cluster")
%time result = gr.cluster().apply(praderas)
# 11.4s !
result[result.Count > 2]
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# | Chromosome | Start | End | Strand | probability | read | Cluster | ProbSum | Count |
# | (category) | (int32) | (int32) | (category) | (float64) | (int64) | (int32) | (float64) | (int64) |
# |--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------|
# | chr1 | 52952 | 53052 | + | 0.7411051557901921 | 59695 | 70 | 2.2131010082513884 | 3 |
# | chr1 | 52959 | 53059 | + | 0.9979036360671423 | 356518 | 70 | 2.2131010082513884 | 3 |
# | chr1 | 53029 | 53129 | + | 0.47409221639405397 | 104776 | 70 | 2.2131010082513884 | 3 |
# | chr1 | 64657 | 64757 | + | 0.32465233067499366 | 386140 | 88 | 1.3880589602361695 | 3 |
# | ... | ... | ... | ... | ... | ... | ... | ... | ... |
# | chrY | 59356855 | 59356955 | - | 0.3877207561218887 | 9966373 | 8502533 | 1.182153891322546 | 4 |
# | chrY | 59356865 | 59356965 | - | 0.4007557656399032 | 9907364 | 8502533 | 1.182153891322546 | 4 |
# | chrY | 59356932 | 59357032 | - | 0.33799123310907786 | 9978653 | 8502533 | 1.182153891322546 | 4 |
# | chrY | 59356980 | 59357080 | - | 0.055686136451676305 | 9994845 | 8502533 | 1.182153891322546 | 4 |
# +--------------+-----------+-----------+--------------+----------------------+-----------+-----------+--------------------+-----------+
# Stranded PyRanges object has 606,212 rows and 9 columns from 24 chromosomes.
# For printing, the PyRanges was sorted on Chromosome and Strand.
How to join two dataframes for which column values are within a certain range?
One simple solution is create interval index
from start and end
setting closed = both
then use get_loc
to get the event i.e (Hope all the date times are in timestamps dtype )
df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])
Output :
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
Related Topics
R and Python in One Jupyter Notebook
Best Way to Join/Merge by Range in Pandas
Access an Arbitrary Element in a Dictionary in Python
How to Remove Non-Ascii Characters But Leave Periods and Spaces
Execution of Python Code with -M Option or Not
Python: Get a Frequency Count Based on Two Columns (Variables) in Pandas Dataframe Some Row Appers
Are Tuples More Efficient Than Lists in Python
List of Lists into Numpy Array
Why Does Foo.Append(Bar) Affect All Elements in a List of Lists
How to Load All Entries in an Infinite Scroll at Once to Parse the HTML in Python
Show Default Value for Editing on Python Input Possible
Error: Command 'Gcc' Failed with Exit Status 1 While Installing Eventlet
Lost Connection to MySQL Server During Query