How to Join Two Dataframes For Which Column Values Are Within a Certain Range

how to join two dataframes for which column values are within a certain range for multiple columns using pandas dataframe?

Solution 1: Simple Solution for small dataset

For small dataset, you can cross join df1 and df2 by .merge(), then filter by the conditions where the Price is within range and year is within range using .query() specifying the conditions, as follows:

(df1.merge(df2, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)

If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross', you can use:

(df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)

Result:

   Price  year  score
0 10 2001 20
4 70 2002 50
8 50 2010 30

Solution 2: Numpy Solution for large dataset

For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:

We look for Price in df2 is within price range in df1 and year in df2 is within year range in df1:

d2_P = df2.Price.values
d2_Y = df2.year.values

d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
np.column_stack([df1.values[j], df2.values[i]]),
columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]

Result:

   Price  year  score
0 10 2001 20
1 70 2002 50
2 50 2010 30

Performance Comparison

Part 1: Compare for original datasets of 3 rows each:

Solution 1:

%%timeit
(df1.merge(df2, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)

5.91 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Solution 2:

%%timeit
d2_P = df2.Price.values
d2_Y = df2.year.values

d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
np.column_stack([df1.values[j], df2.values[i]]),
columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]

703 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Benchmark summary: 5.91 ms vs 703 µs, that is 8.4x times faster

Part 2: Compare for datasets with 3,000 and 30,000 rows:

Data Setup:

df1a = pd.concat([df1] * 1000, ignore_index=True)
df2a = pd.concat([df2] * 10000, ignore_index=True)

Solution 1:

%%timeit
(df1a.merge(df2a, how='cross')
.query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
[['Price', 'year', 'score']]
)

27.5 s ± 3.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 2:

%%timeit
d2_P = df2a.Price.values
d2_Y = df2a.year.values

d1_PS = df1a.price_start.values
d1_PE = df1a.price_end.values
d1_YS = df1a.year_start.values
d1_YE = df1a.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
np.column_stack([df1a.values[j], df2a.values[i]]),
columns=df1a.columns.append(df2a.columns)
)[['Price', 'year', 'score']]

3.83 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Benchmark summary: 27.5 s vs 3.83 s, that is 7.2x times faster

How to join two dataframes for which column values are within a certain range?

One simple solution is create interval index from start and end setting closed = both then use get_loc to get the event i.e (Hope all the date times are in timestamps dtype )

df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])

Output :


timestamp A B event
0 2016-05-14 10:54:33 0.020228 0.026572 E1
1 2016-05-14 10:54:34 0.057780 0.175499 E2
2 2016-05-14 10:54:35 0.098808 0.620986 E2
3 2016-05-14 10:54:36 0.158789 1.014819 E2
4 2016-05-14 10:54:39 0.038129 2.384590 E3

How to join two dataframes for which 2 columns values are within a certain 2 ranges python?

Option 1

If you're using pandas 1.2.0 you can create the cartesian product of both dataframes and then check the conditions. Also, as you don't need RT [min] and Molecular Weight from df1, I'll assume you already removed them:

df3 = df1.merge(df2, how = 'cross', suffixes = [None,None])

#check if 'Molecular Weight' is in the interval:
mask1 = df3['Molecular Weight'].ge(df3['Molecular Weight - 0.2']) & df3['Molecular Weight'].le(df3['Molecular Weight + 0.2'])

#check if 'RT [min]' is in the interval
mask2 = df3['RT [min]'].ge(df3['RT [min]-0.2']) & df3['RT [min]'].le(df3['RT [min]+0.2'])

df3 = df3[mask1 & mask2].reset_index(drop = True)

Output:

df3
Name df1 RT [min]+0.2 RT [min]-0.2 ... Name df2 Molecular Weight RT [min]
0 unknow compound 1 7.79 7.39 ... β-D-Glucopyranuronic acid 194.0422 7.483
1 unknow compound 2 7.71 7.31 ... β-D-Glucopyranuronic acid 194.0422 7.483
2 unknow compound 2 7.71 7.31 ... α,α-Trehalose 194.1000 7.350
3 unknow compound 3 7.61 7.21 ... β-D-Glucopyranuronic acid 194.0422 7.483
4 unknow compound 3 7.61 7.21 ... α,α-Trehalose 194.1000 7.350

Option 2

As your data is considerably large, may you like to use a generator in order to don't load the whole resulting dataframe. Again, I'm assuming you removed RT [min] and Molecular Weight from df1.

import numpy as np
from itertools import product

def df_iter(df1,df2):
for row1, row2 in product(df1.values, df2.values):

# RT [min]-0.2 <= RT [min] <= RT [min]+0.2
if row1[2] <= row2[2] <= row1[1]:

#Molecular Weight - 0.2 <= Molecular Weight <= Molecular Weight + 0.2
if row1[4] <= row2[1] <= row1[3]:
yield np.concatenate((row1,row2))

df3_rows = df_iter(df1,df2)

Then you can manipulate the rows:

for row in df3_rows:
print(row)

Output:

['unknow compound 1' 7.79 7.39 194.24212 193.84212 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'α,α-Trehalose' 194.1 7.35]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'α,α-Trehalose' 194.1 7.35]

Or create a dataframe:

df3 = pd.DataFrame(data = list(df3_rows),
columns = np.concatenate((df1.columns, df2.columns)))

Which results in the same dataframe from Option 1.

NOTE1: Be careful with the indices in the conditionals from function df_iter, those work in my case.

NOTE2: I'm pretty sure your data doesn't match with the example df3.

Merging two DataFrame using a range of columns (Right on ID and left on multiple IDs)

Suppose we have the following two dataframes:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
{
"id": [1, 2, 3],
"month": ["Jan", "Mar", "Apr"],
"year": ["2022", "2020", "2021"],
"column_A": ["test", "test_", "test__"]
}
)

df2 = pd.DataFrame(
{
"id_name": [1, np.NaN, np.NaN],
"id_surname": [np.NaN, 2, np.NaN],
"id_first_name": [np.NaN, np.NaN, 3],
"month": ["Jan", "Mar", "Apr"],
"year": ["2022", "2020", "2021"],
"column_B": ["check", "check_", "check__"]
}
)

The second dataframe will be:

   id_name  id_surname  id_first_name month  year column_B
0 1.0 NaN NaN Jan 2022 check
1 NaN 2.0 NaN Mar 2020 check_
2 NaN NaN 3.0 Apr 2021 check__

You can create a new column id for the second dataframe by keeping all non NaN values from the three columns id_name, id_surname, id_first_name. Starting from the id_name column and filling its NaNs with non Nans values of id_surname and then filling the remaining NaNs with the non-NaNs of the id_first_name. The code to do that is:

df2["id"] = df2["id_name"].fillna(df2["id_surname"]).fillna(df2["id_first_name"])

which will create the column id for the df2:

   id_name  id_surname  id_first_name month  year column_B   id
0 1.0 NaN NaN Jan 2022 check 1.0
1 NaN 2.0 NaN Mar 2020 check_ 2.0
2 NaN NaN 3.0 Apr 2021 check__ 3.0

Finally, you can merge by:

merged = pd.merge(
df1,
df2,
left_on=["id", "month", "year"],
right_on=["id", "month", "year"],
how="left",
)

and the result will be:

   id month  year column_A  id_name  id_surname  id_first_name column_B
0 1 Jan 2022 test 1.0 NaN NaN check
1 2 Mar 2020 test_ NaN 2.0 NaN check_
2 3 Apr 2021 test__ NaN NaN 3.0 check__

Join two dataframes by range and values

First find the smallest Value that is larger than Start, then make sure it is smaller than End:

import pandas as pd
df1 = pd.DataFrame({'Value':[11000,21040,12050], 'Responsible':['Jack', 'Dylan', 'Jack']})
df2 = pd.DataFrame({'Start':[10001,20001], 'End':[20000, 30000]})

df = pd.merge_asof(df2.sort_values('Start'), df1.sort_values('Value'),
left_on = 'Start', right_on = 'Value', direction='forward')
df = df[df['Value']<df['End']].drop(columns = 'Value')
    Start   End     Responsible
0 10001 20000 Jack
1 20001 30000 Dylan

Left join pandas if column value is within a certain range?

Since pandas 1.2.0., you can cross merge, which creates the cartesian product from the two DataFrames. So cross merge and filter the columns where the states match. Then find the absolute difference between the zip codes and use it to identify the rows where the distance is the closest for each "Zip_left". Finally, mask the rows where the difference is greater than 15 (even if the closest), so we fill them with NaN:

merged = df_left.merge(df_right, how='cross', suffixes=('_left', '_right'))
merged = merged[merged['State_left']==merged['State_right']]
merged['Diff'] = merged['Zip_left'].sub(merged['Zip_right']).abs()
merged = merged[merged.groupby('Zip_left')['Diff'].transform('min') == merged['Diff']]
cols = merged.columns[~merged.columns.str.endswith('left')]
merged[cols] = merged[cols].mask(merged['Diff']>15)
out = merged.drop(columns=['State_right','Diff']).rename(columns={'State_left':'State'}).reset_index(drop=True)

Output:

   Zip_left State  Zip_right  Average_Rent
0 10001 NY 10003.0 1200.0
1 10007 NY 10008.0 1460.0
2 10013 NY 10010.0 1900.0
3 90011 CA 90011.0 850.0
4 91331 CA NaN NaN
5 90650 CA 90645.0 2300.0

Join two data frame by considering if values of paired columns are in range of the value of paired columns in the other dataframe

I suppose I would propose two ways of doing this depending on your preference. The first would be using SQL instead of R for the task. It’s a bit more straightforward for the type of join you’re describing.

library(sqldf)
library(dplyr)

df1<-data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3))
df2<-data.frame("m1"=c("100010","100010","100010"),"m2"=c("100020","100020","100020"),"week"=c(1,2,3),"freq"=c(3,1,2))
df3<- data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3),"freq"=c(3,1,2,3,3,2))

df_sql <-
sqldf::sqldf("SELECT a.*, b.freq
FROM df1 a
LEFT JOIN df2 b
ON (a.week = b.week and a.m1 = b.m1 and a.m2 = b.m2) OR
(a.week = b.week and a.m1 = b.m2 and a.m2 = b.m1)")

identical(df_sql, df3)
#> [1] TRUE

I am sure there are more elegant ways to do this, but the second strategy is just to duplicate df2, rename the columns with m1 and m2 reversed, and then do the join.

df <-
df2 %>%
rename(m2 = m1, m1 = m2) %>%
bind_rows(df2, .) %>%
left_join(df1, ., by = c("week", "m1", "m2"))

identical(df, df3)
#> [1] TRUE

I imagine there are other ways that don’t involve a join, but that’s how I would do it using joins.

Created on 2022-02-17 by the reprex package (v2.0.1)

join two dataframes where the column values (a set) is a subset of the other

Create your dataframes

import pandas as pd

df1 = pd.DataFrame({'key': [1, 1],
'id': [0, 1],
'items': [set(['foo', 'baz']), set(['bar', 'baz'])]})

df2 = pd.DataFrame({'key': [1, 1, 1, 1],
'items': [set(['bar', 'baz', 'foo']), set(['bar', 'baz', 'foo']), set(['bar', 'baz', 'foo']), set(['one', 'two', 'bar'])],
'other': [1, 2, 3, 2]
})

then make a cartesian product

merged_df = df1.merge(df2, on='key')
merged_df

key id items_x items_y other
0 1 0 {baz, foo} {foo, baz, bar} 1
1 1 0 {baz, foo} {foo, baz, bar} 2
2 1 0 {baz, foo} {foo, baz, bar} 3
3 1 0 {baz, foo} {one, bar, two} 2
4 1 1 {baz, bar} {foo, baz, bar} 1
5 1 1 {baz, bar} {foo, baz, bar} 2
6 1 1 {baz, bar} {foo, baz, bar} 3
7 1 1 {baz, bar} {one, bar, two} 2

define your custom function and see if it works in one case

def check_if_all_in_list(list1, list2):
return all(elem in list2 for elem in list1)

check_if_all_in_list(merged_df['items_x'][0], merged_df['items_y'][0])
True

Create your match

merged_df['check'] = merged_df.apply(lambda row: check_if_all_in_list(row['items_x'], row['items_y']), axis=1)
merged_df

key id items_x items_y other check
0 1 0 {baz, foo} {foo, baz, bar} 1 True
1 1 0 {baz, foo} {foo, baz, bar} 2 True
2 1 0 {baz, foo} {foo, baz, bar} 3 True
3 1 0 {baz, foo} {one, bar, two} 2 False
4 1 1 {baz, bar} {foo, baz, bar} 1 True
5 1 1 {baz, bar} {foo, baz, bar} 2 True
6 1 1 {baz, bar} {foo, baz, bar} 3 True
7 1 1 {baz, bar} {one, bar, two} 2 False

now filter out what you don't want

mask = (merged_df['check']==True)
merged_df[mask]

key id items_x items_y other check
0 1 0 {baz, foo} {foo, baz, bar} 1 True
1 1 0 {baz, foo} {foo, baz, bar} 2 True
2 1 0 {baz, foo} {foo, baz, bar} 3 True
4 1 1 {baz, bar} {foo, baz, bar} 1 True
5 1 1 {baz, bar} {foo, baz, bar} 2 True
6 1 1 {baz, bar} {foo, baz, bar} 3 True

Join two DataFrames on common columns only if the difference in a separate column is within range [-n, +n]

We can merge, then perform a query to drop rows not within the range:

(df1.merge(df2, on=['Date', 'BillNo.'])
.query('abs(Amount_x - Amount_y) <= 5')
.drop('Amount_x', axis=1))

Date BillNo. Amount_y
0 10/08/2020 ABBCSQ1ZA 876
1 10/16/2020 AA171E1Z0 5491

This works well as long as there is only one row that corresponds to a specific (Date, BillNo) combination in each frame.



Related Topics



Leave a reply



Submit