how to join two dataframes for which column values are within a certain range for multiple columns using pandas dataframe?
Solution 1: Simple Solution for small dataset
For small dataset, you can cross join df1
and df2
by .merge()
, then filter by the conditions where the Price is within range and year is within range using .query()
specifying the conditions, as follows:
(df1.merge(df2, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross'
, you can use:
(df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
Result:
Price year score
0 10 2001 20
4 70 2002 50
8 50 2010 30
Solution 2: Numpy Solution for large dataset
For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:
We look for Price
in df2
is within price range in df1
and year
in df2
is within year range in df1
:
d2_P = df2.Price.values
d2_Y = df2.year.values
d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values
i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))
pd.DataFrame(
np.column_stack([df1.values[j], df2.values[i]]),
columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]
Result:
Price year score
0 10 2001 20
1 70 2002 50
2 50 2010 30
Performance Comparison
Part 1: Compare for original datasets of 3 rows each:
Solution 1:
%%timeit
(df1.merge(df2, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
5.91 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Solution 2:
%%timeit
d2_P = df2.Price.values
d2_Y = df2.year.values
d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values
i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))
pd.DataFrame(
np.column_stack([df1.values[j], df2.values[i]]),
columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]
703 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Benchmark summary: 5.91 ms vs 703 µs, that is 8.4x times faster
Part 2: Compare for datasets with 3,000 and 30,000 rows:
Data Setup:
df1a = pd.concat([df1] * 1000, ignore_index=True)
df2a = pd.concat([df2] * 10000, ignore_index=True)
Solution 1:
%%timeit
(df1a.merge(df2a, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)
27.5 s ± 3.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Solution 2:
%%timeit
d2_P = df2a.Price.values
d2_Y = df2a.year.values
d1_PS = df1a.price_start.values
d1_PE = df1a.price_end.values
d1_YS = df1a.year_start.values
d1_YE = df1a.year_end.values
i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))
pd.DataFrame(
np.column_stack([df1a.values[j], df2a.values[i]]),
columns=df1a.columns.append(df2a.columns)
)[['Price', 'year', 'score']]
3.83 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Benchmark summary: 27.5 s vs 3.83 s, that is 7.2x times faster
How to join two dataframes for which column values are within a certain range?
One simple solution is create interval index
from start and end
setting closed = both
then use get_loc
to get the event i.e (Hope all the date times are in timestamps dtype )
df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])
Output :
timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3
How to join two dataframes for which 2 columns values are within a certain 2 ranges python?
Option 1
If you're using pandas 1.2.0 you can create the cartesian product of both dataframes and then check the conditions. Also, as you don't need RT [min]
and Molecular Weight
from df1
, I'll assume you already removed them:
df3 = df1.merge(df2, how = 'cross', suffixes = [None,None])
#check if 'Molecular Weight' is in the interval:
mask1 = df3['Molecular Weight'].ge(df3['Molecular Weight - 0.2']) & df3['Molecular Weight'].le(df3['Molecular Weight + 0.2'])
#check if 'RT [min]' is in the interval
mask2 = df3['RT [min]'].ge(df3['RT [min]-0.2']) & df3['RT [min]'].le(df3['RT [min]+0.2'])
df3 = df3[mask1 & mask2].reset_index(drop = True)
Output:
df3
Name df1 RT [min]+0.2 RT [min]-0.2 ... Name df2 Molecular Weight RT [min]
0 unknow compound 1 7.79 7.39 ... β-D-Glucopyranuronic acid 194.0422 7.483
1 unknow compound 2 7.71 7.31 ... β-D-Glucopyranuronic acid 194.0422 7.483
2 unknow compound 2 7.71 7.31 ... α,α-Trehalose 194.1000 7.350
3 unknow compound 3 7.61 7.21 ... β-D-Glucopyranuronic acid 194.0422 7.483
4 unknow compound 3 7.61 7.21 ... α,α-Trehalose 194.1000 7.350
Option 2
As your data is considerably large, may you like to use a generator in order to don't load the whole resulting dataframe. Again, I'm assuming you removed RT [min]
and Molecular Weight
from df1
.
import numpy as np
from itertools import product
def df_iter(df1,df2):
for row1, row2 in product(df1.values, df2.values):
# RT [min]-0.2 <= RT [min] <= RT [min]+0.2
if row1[2] <= row2[2] <= row1[1]:
#Molecular Weight - 0.2 <= Molecular Weight <= Molecular Weight + 0.2
if row1[4] <= row2[1] <= row1[3]:
yield np.concatenate((row1,row2))
df3_rows = df_iter(df1,df2)
Then you can manipulate the rows:
for row in df3_rows:
print(row)
Output:
['unknow compound 1' 7.79 7.39 194.24212 193.84212 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'α,α-Trehalose' 194.1 7.35]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'α,α-Trehalose' 194.1 7.35]
Or create a dataframe:
df3 = pd.DataFrame(data = list(df3_rows),
columns = np.concatenate((df1.columns, df2.columns)))
Which results in the same dataframe from Option 1.
NOTE1: Be careful with the indices in the conditionals from function df_iter
, those work in my case.
NOTE2: I'm pretty sure your data doesn't match with the example df3
.
Merging two DataFrame using a range of columns (Right on ID and left on multiple IDs)
Suppose we have the following two dataframes:
import pandas as pd
import numpy as np
df1 = pd.DataFrame(
{
"id": [1, 2, 3],
"month": ["Jan", "Mar", "Apr"],
"year": ["2022", "2020", "2021"],
"column_A": ["test", "test_", "test__"]
}
)
df2 = pd.DataFrame(
{
"id_name": [1, np.NaN, np.NaN],
"id_surname": [np.NaN, 2, np.NaN],
"id_first_name": [np.NaN, np.NaN, 3],
"month": ["Jan", "Mar", "Apr"],
"year": ["2022", "2020", "2021"],
"column_B": ["check", "check_", "check__"]
}
)
The second dataframe will be:
id_name id_surname id_first_name month year column_B
0 1.0 NaN NaN Jan 2022 check
1 NaN 2.0 NaN Mar 2020 check_
2 NaN NaN 3.0 Apr 2021 check__
You can create a new column id
for the second dataframe by keeping all non NaN values from the three columns id_name, id_surname, id_first_name
. Starting from the id_name
column and filling its NaNs with non Nans values of id_surname
and then filling the remaining NaNs with the non-NaNs of the id_first_name
. The code to do that is:
df2["id"] = df2["id_name"].fillna(df2["id_surname"]).fillna(df2["id_first_name"])
which will create the column id
for the df2
:
id_name id_surname id_first_name month year column_B id
0 1.0 NaN NaN Jan 2022 check 1.0
1 NaN 2.0 NaN Mar 2020 check_ 2.0
2 NaN NaN 3.0 Apr 2021 check__ 3.0
Finally, you can merge by:
merged = pd.merge(
df1,
df2,
left_on=["id", "month", "year"],
right_on=["id", "month", "year"],
how="left",
)
and the result will be:
id month year column_A id_name id_surname id_first_name column_B
0 1 Jan 2022 test 1.0 NaN NaN check
1 2 Mar 2020 test_ NaN 2.0 NaN check_
2 3 Apr 2021 test__ NaN NaN 3.0 check__
Join two dataframes by range and values
First find the smallest Value
that is larger than Start
, then make sure it is smaller than End
:
import pandas as pd
df1 = pd.DataFrame({'Value':[11000,21040,12050], 'Responsible':['Jack', 'Dylan', 'Jack']})
df2 = pd.DataFrame({'Start':[10001,20001], 'End':[20000, 30000]})
df = pd.merge_asof(df2.sort_values('Start'), df1.sort_values('Value'),
left_on = 'Start', right_on = 'Value', direction='forward')
df = df[df['Value']<df['End']].drop(columns = 'Value')
Start End Responsible
0 10001 20000 Jack
1 20001 30000 Dylan
Left join pandas if column value is within a certain range?
Since pandas 1.2.0., you can cross merge
, which creates the cartesian product from the two DataFrames. So cross merge and filter the columns where the states match. Then find the absolute difference between the zip codes and use it to identify the rows where the distance is the closest for each "Zip_left". Finally, mask
the rows where the difference is greater than 15 (even if the closest), so we fill them with NaN:
merged = df_left.merge(df_right, how='cross', suffixes=('_left', '_right'))
merged = merged[merged['State_left']==merged['State_right']]
merged['Diff'] = merged['Zip_left'].sub(merged['Zip_right']).abs()
merged = merged[merged.groupby('Zip_left')['Diff'].transform('min') == merged['Diff']]
cols = merged.columns[~merged.columns.str.endswith('left')]
merged[cols] = merged[cols].mask(merged['Diff']>15)
out = merged.drop(columns=['State_right','Diff']).rename(columns={'State_left':'State'}).reset_index(drop=True)
Output:
Zip_left State Zip_right Average_Rent
0 10001 NY 10003.0 1200.0
1 10007 NY 10008.0 1460.0
2 10013 NY 10010.0 1900.0
3 90011 CA 90011.0 850.0
4 91331 CA NaN NaN
5 90650 CA 90645.0 2300.0
Join two data frame by considering if values of paired columns are in range of the value of paired columns in the other dataframe
I suppose I would propose two ways of doing this depending on your preference. The first would be using SQL instead of R for the task. It’s a bit more straightforward for the type of join you’re describing.
library(sqldf)
library(dplyr)
df1<-data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3))
df2<-data.frame("m1"=c("100010","100010","100010"),"m2"=c("100020","100020","100020"),"week"=c(1,2,3),"freq"=c(3,1,2))
df3<- data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3),"freq"=c(3,1,2,3,3,2))
df_sql <-
sqldf::sqldf("SELECT a.*, b.freq
FROM df1 a
LEFT JOIN df2 b
ON (a.week = b.week and a.m1 = b.m1 and a.m2 = b.m2) OR
(a.week = b.week and a.m1 = b.m2 and a.m2 = b.m1)")
identical(df_sql, df3)
#> [1] TRUE
I am sure there are more elegant ways to do this, but the second strategy is just to duplicate df2
, rename the columns with m1
and m2
reversed, and then do the join.
df <-
df2 %>%
rename(m2 = m1, m1 = m2) %>%
bind_rows(df2, .) %>%
left_join(df1, ., by = c("week", "m1", "m2"))
identical(df, df3)
#> [1] TRUE
I imagine there are other ways that don’t involve a join, but that’s how I would do it using joins.
Created on 2022-02-17 by the reprex package (v2.0.1)
join two dataframes where the column values (a set) is a subset of the other
Create your dataframes
import pandas as pd
df1 = pd.DataFrame({'key': [1, 1],
'id': [0, 1],
'items': [set(['foo', 'baz']), set(['bar', 'baz'])]})
df2 = pd.DataFrame({'key': [1, 1, 1, 1],
'items': [set(['bar', 'baz', 'foo']), set(['bar', 'baz', 'foo']), set(['bar', 'baz', 'foo']), set(['one', 'two', 'bar'])],
'other': [1, 2, 3, 2]
})
then make a cartesian product
merged_df = df1.merge(df2, on='key')
merged_df
key id items_x items_y other
0 1 0 {baz, foo} {foo, baz, bar} 1
1 1 0 {baz, foo} {foo, baz, bar} 2
2 1 0 {baz, foo} {foo, baz, bar} 3
3 1 0 {baz, foo} {one, bar, two} 2
4 1 1 {baz, bar} {foo, baz, bar} 1
5 1 1 {baz, bar} {foo, baz, bar} 2
6 1 1 {baz, bar} {foo, baz, bar} 3
7 1 1 {baz, bar} {one, bar, two} 2
define your custom function and see if it works in one case
def check_if_all_in_list(list1, list2):
return all(elem in list2 for elem in list1)
check_if_all_in_list(merged_df['items_x'][0], merged_df['items_y'][0])
True
Create your match
merged_df['check'] = merged_df.apply(lambda row: check_if_all_in_list(row['items_x'], row['items_y']), axis=1)
merged_df
key id items_x items_y other check
0 1 0 {baz, foo} {foo, baz, bar} 1 True
1 1 0 {baz, foo} {foo, baz, bar} 2 True
2 1 0 {baz, foo} {foo, baz, bar} 3 True
3 1 0 {baz, foo} {one, bar, two} 2 False
4 1 1 {baz, bar} {foo, baz, bar} 1 True
5 1 1 {baz, bar} {foo, baz, bar} 2 True
6 1 1 {baz, bar} {foo, baz, bar} 3 True
7 1 1 {baz, bar} {one, bar, two} 2 False
now filter out what you don't want
mask = (merged_df['check']==True)
merged_df[mask]
key id items_x items_y other check
0 1 0 {baz, foo} {foo, baz, bar} 1 True
1 1 0 {baz, foo} {foo, baz, bar} 2 True
2 1 0 {baz, foo} {foo, baz, bar} 3 True
4 1 1 {baz, bar} {foo, baz, bar} 1 True
5 1 1 {baz, bar} {foo, baz, bar} 2 True
6 1 1 {baz, bar} {foo, baz, bar} 3 True
Join two DataFrames on common columns only if the difference in a separate column is within range [-n, +n]
We can merge, then perform a query to drop rows not within the range:
(df1.merge(df2, on=['Date', 'BillNo.'])
.query('abs(Amount_x - Amount_y) <= 5')
.drop('Amount_x', axis=1))
Date BillNo. Amount_y
0 10/08/2020 ABBCSQ1ZA 876
1 10/16/2020 AA171E1Z0 5491
This works well as long as there is only one row that corresponds to a specific (Date, BillNo) combination in each frame.
Related Topics
How to Extract a Floating Number from a String
How to Change a Global Variable from Within a Function
How to Select a Specific Input Device with Pyaudio
Python: Interplay Between Lib/Site-Packages/Site.Py and Lib/Site.Py
/Usr/Bin/Env: Python2: No Such File or Directory
How to Cleanly Kill Subprocesses in Python
Howto Do Python Command-Line Autocompletion But Not Only at the Beginning of a String
How to Access Bluetooth Low Level Functions in Pybluez
How to Send Input on Stdin to a Python Script Defined Inside a Makefile
Sieve of Eratosthenes - Finding Primes Python
Overriding Special Methods on an Instance
Schedule Python Script with Crontab
How to Handle Os.System Sigkill Signal Inside Python
Shell Start/Stop for Python Script
What Is the Return Value of Subprocess.Call()
Python Multiprocessing - Debugging Oserror: [Errno 12] Cannot Allocate Memory