How to Join Two Dataframes For Which Column Values Are Within a Certain Range

how to join two dataframes for which column values are within a certain range for multiple columns using pandas dataframe?

Solution 1: Simple Solution for small dataset

For small dataset, you can cross join df1 and df2 by .merge(), then filter by the conditions where the Price is within range and year is within range using .query() specifying the conditions, as follows:

(df1.merge(df2, how='cross')
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

If your Pandas version is older than 1.2.0 (released in December 2020) and does not support merge with how='cross', you can use:

(df1.assign(key=1).merge(df2.assign(key=1), on='key').drop('key', axis=1)
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

Result:

   Price  year  score
0     10  2001     20
4     70  2002     50
8     50  2010     30

Solution 2: Numpy Solution for large dataset

For large dataset and performance is a concern, you can use numpy broadcasting (instead of cross join and filtering) to speed up the execution time:

We look for Price in df2 is within price range in df1 and year in df2 is within year range in df1:

d2_P = df2.Price.values
d2_Y = df2.year.values

d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
    np.column_stack([df1.values[j], df2.values[i]]),
    columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]

Result:

   Price  year  score
0     10  2001     20
1     70  2002     50
2     50  2010     30

Performance Comparison

Part 1: Compare for original datasets of 3 rows each:

Solution 1:

%%timeit
(df1.merge(df2, how='cross')
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

5.91 ms ± 87.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Solution 2:

%%timeit
d2_P = df2.Price.values
d2_Y = df2.year.values

d1_PS = df1.price_start.values
d1_PE = df1.price_end.values
d1_YS = df1.year_start.values
d1_YE = df1.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
    np.column_stack([df1.values[j], df2.values[i]]),
    columns=df1.columns.append(df2.columns)
)[['Price', 'year', 'score']]

703 µs ± 9.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Benchmark summary: 5.91 ms vs 703 µs, that is 8.4x times faster

Part 2: Compare for datasets with 3,000 and 30,000 rows:

Data Setup:

df1a = pd.concat([df1] * 1000, ignore_index=True)
df2a = pd.concat([df2] * 10000, ignore_index=True)

Solution 1:

%%timeit
(df1a.merge(df2a, how='cross')
    .query('(Price >= price_start) & (Price <= price_end) & (year >= year_start) & (year <= year_end)')
    [['Price', 'year', 'score']]
)

27.5 s ± 3.24 s per loop (mean ± std. dev. of 7 runs, 1 loop each)

Solution 2:

%%timeit
d2_P = df2a.Price.values
d2_Y = df2a.year.values

d1_PS = df1a.price_start.values
d1_PE = df1a.price_end.values
d1_YS = df1a.year_start.values
d1_YE = df1a.year_end.values

i, j = np.where((d2_P[:, None] >= d1_PS) & (d2_P[:, None] <= d1_PE) & (d2_Y[:, None] >= d1_YS) & (d2_Y[:, None] <= d1_YE))

pd.DataFrame(
    np.column_stack([df1a.values[j], df2a.values[i]]),
    columns=df1a.columns.append(df2a.columns)
)[['Price', 'year', 'score']]

3.83 s ± 136 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Benchmark summary: 27.5 s vs 3.83 s, that is 7.2x times faster

How to join two dataframes for which column values are within a certain range?

One simple solution is create interval index from start and end setting closed = both then use get_loc to get the event i.e (Hope all the date times are in timestamps dtype )

df_2.index = pd.IntervalIndex.from_arrays(df_2['start'],df_2['end'],closed='both')
df_1['event'] = df_1['timestamp'].apply(lambda x : df_2.iloc[df_2.index.get_loc(x)]['event'])

Output :


            timestamp         A         B event
0 2016-05-14 10:54:33  0.020228  0.026572    E1
1 2016-05-14 10:54:34  0.057780  0.175499    E2
2 2016-05-14 10:54:35  0.098808  0.620986    E2
3 2016-05-14 10:54:36  0.158789  1.014819    E2
4 2016-05-14 10:54:39  0.038129  2.384590    E3

How to join two dataframes for which 2 columns values are within a certain 2 ranges python?

Option 1

If you're using pandas 1.2.0 you can create the cartesian product of both dataframes and then check the conditions. Also, as you don't need RT [min] and Molecular Weight from df1, I'll assume you already removed them:

df3 = df1.merge(df2, how = 'cross', suffixes = [None,None])

#check if 'Molecular Weight' is in the interval:
mask1 = df3['Molecular Weight'].ge(df3['Molecular Weight - 0.2']) & df3['Molecular Weight'].le(df3['Molecular Weight + 0.2'])

#check if 'RT [min]' is in the interval
mask2 = df3['RT [min]'].ge(df3['RT [min]-0.2']) & df3['RT [min]'].le(df3['RT [min]+0.2'])

df3 = df3[mask1 & mask2].reset_index(drop = True)

Output:

df3
            Name df1  RT [min]+0.2  RT [min]-0.2  ...                   Name df2  Molecular Weight RT [min]
0  unknow compound 1          7.79          7.39  ...  β-D-Glucopyranuronic acid          194.0422    7.483
1  unknow compound 2          7.71          7.31  ...  β-D-Glucopyranuronic acid          194.0422    7.483
2  unknow compound 2          7.71          7.31  ...              α,α-Trehalose          194.1000    7.350
3  unknow compound 3          7.61          7.21  ...  β-D-Glucopyranuronic acid          194.0422    7.483
4  unknow compound 3          7.61          7.21  ...              α,α-Trehalose          194.1000    7.350

Option 2

As your data is considerably large, may you like to use a generator in order to don't load the whole resulting dataframe. Again, I'm assuming you removed RT [min] and Molecular Weight from df1.

import numpy as np
from itertools import product

def df_iter(df1,df2):
    for row1, row2 in product(df1.values, df2.values):

        # RT [min]-0.2 <=  RT [min] <=  RT [min]+0.2
        if row1[2] <= row2[2] <= row1[1]:
            
            #Molecular Weight - 0.2 <= Molecular Weight <= Molecular Weight + 0.2
            if row1[4] <= row2[1] <= row1[3]:
                yield np.concatenate((row1,row2))

df3_rows = df_iter(df1,df2)

Then you can manipulate the rows:

for row in df3_rows:
    print(row)

Output:

['unknow compound 1' 7.79 7.39 194.24212 193.84212 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 2' 7.71 7.31 194.35 193.95 'α,α-Trehalose' 194.1 7.35]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'β-D-Glucopyranuronic acid' 194.0422 7.483]
['unknow compound 3' 7.61 7.21 194.24209 193.84209 'α,α-Trehalose' 194.1 7.35]

Or create a dataframe:

df3 = pd.DataFrame(data = list(df3_rows),
      columns = np.concatenate((df1.columns, df2.columns)))

Which results in the same dataframe from Option 1.

NOTE1: Be careful with the indices in the conditionals from function df_iter, those work in my case.

NOTE2: I'm pretty sure your data doesn't match with the example df3.

Merging two DataFrame using a range of columns (Right on ID and left on multiple IDs)

Suppose we have the following two dataframes:

import pandas as pd
import numpy as np

df1 = pd.DataFrame(
    {
        "id": [1, 2, 3],
        "month": ["Jan", "Mar", "Apr"],
        "year": ["2022", "2020", "2021"],
        "column_A": ["test", "test_", "test__"]
    }
)

df2 = pd.DataFrame(
    {
        "id_name": [1, np.NaN, np.NaN],
        "id_surname": [np.NaN, 2, np.NaN],
        "id_first_name": [np.NaN, np.NaN, 3],
        "month": ["Jan", "Mar", "Apr"],
        "year": ["2022", "2020", "2021"],
        "column_B": ["check", "check_", "check__"]
    }
)

The second dataframe will be:

   id_name  id_surname  id_first_name month  year column_B
0      1.0         NaN            NaN   Jan  2022   check
1      NaN         2.0            NaN   Mar  2020   check_
2      NaN         NaN            3.0   Apr  2021   check__

You can create a new column id for the second dataframe by keeping all non NaN values from the three columns id_name, id_surname, id_first_name. Starting from the id_name column and filling its NaNs with non Nans values of id_surname and then filling the remaining NaNs with the non-NaNs of the id_first_name. The code to do that is:

df2["id"] = df2["id_name"].fillna(df2["id_surname"]).fillna(df2["id_first_name"])

which will create the column id for the df2:

   id_name  id_surname  id_first_name month  year column_B   id
0      1.0         NaN            NaN   Jan  2022   check    1.0
1      NaN         2.0            NaN   Mar  2020   check_   2.0
2      NaN         NaN            3.0   Apr  2021   check__  3.0

Finally, you can merge by:

merged = pd.merge(
    df1,
    df2,
    left_on=["id", "month", "year"],
    right_on=["id", "month", "year"],
    how="left",
)

and the result will be:

   id month  year column_A  id_name  id_surname  id_first_name column_B
0   1   Jan  2022     test      1.0         NaN            NaN   check
1   2   Mar  2020    test_      NaN         2.0            NaN   check_
2   3   Apr  2021   test__      NaN         NaN            3.0   check__

Join two dataframes by range and values

First find the smallest Value that is larger than Start, then make sure it is smaller than End:

import pandas as pd
df1 = pd.DataFrame({'Value':[11000,21040,12050], 'Responsible':['Jack', 'Dylan', 'Jack']})
df2 = pd.DataFrame({'Start':[10001,20001], 'End':[20000, 30000]})

df = pd.merge_asof(df2.sort_values('Start'), df1.sort_values('Value'),
                   left_on = 'Start', right_on = 'Value', direction='forward')
df = df[df['Value']<df['End']].drop(columns = 'Value')

    Start   End     Responsible
0   10001   20000   Jack
1   20001   30000   Dylan

Left join pandas if column value is within a certain range?

Since pandas 1.2.0., you can cross merge, which creates the cartesian product from the two DataFrames. So cross merge and filter the columns where the states match. Then find the absolute difference between the zip codes and use it to identify the rows where the distance is the closest for each "Zip_left". Finally, mask the rows where the difference is greater than 15 (even if the closest), so we fill them with NaN:

merged = df_left.merge(df_right, how='cross', suffixes=('_left', '_right'))
merged = merged[merged['State_left']==merged['State_right']]
merged['Diff'] = merged['Zip_left'].sub(merged['Zip_right']).abs()
merged = merged[merged.groupby('Zip_left')['Diff'].transform('min') == merged['Diff']]
cols = merged.columns[~merged.columns.str.endswith('left')]
merged[cols] = merged[cols].mask(merged['Diff']>15)
out = merged.drop(columns=['State_right','Diff']).rename(columns={'State_left':'State'}).reset_index(drop=True)

Output:

   Zip_left State  Zip_right  Average_Rent
0     10001    NY    10003.0        1200.0
1     10007    NY    10008.0        1460.0
2     10013    NY    10010.0        1900.0
3     90011    CA    90011.0         850.0
4     91331    CA        NaN           NaN
5     90650    CA    90645.0        2300.0

Join two data frame by considering if values of paired columns are in range of the value of paired columns in the other dataframe

I suppose I would propose two ways of doing this depending on your preference. The first would be using SQL instead of R for the task. It’s a bit more straightforward for the type of join you’re describing.

library(sqldf)
library(dplyr)

df1<-data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3))
df2<-data.frame("m1"=c("100010","100010","100010"),"m2"=c("100020","100020","100020"),"week"=c(1,2,3),"freq"=c(3,1,2)) 
df3<- data.frame("m1"=c("100010","100010","100010","100020","100020","100020"),"m2"=c("100020","100020","100020","100010","100010","100010"),"week"=c(1,2,3,1,1,3),"freq"=c(3,1,2,3,3,2))

df_sql <- 
  sqldf::sqldf("SELECT a.*, b.freq
               FROM df1 a
               LEFT JOIN df2 b 
               ON (a.week = b.week and a.m1 = b.m1 and a.m2 = b.m2) OR
                  (a.week = b.week and a.m1 = b.m2 and a.m2 = b.m1)")

identical(df_sql, df3)
#> [1] TRUE

I am sure there are more elegant ways to do this, but the second strategy is just to duplicate df2, rename the columns with m1 and m2 reversed, and then do the join.

df <-
  df2 %>%
  rename(m2 = m1, m1 = m2) %>%
  bind_rows(df2, .) %>%
  left_join(df1, ., by = c("week", "m1", "m2"))

identical(df, df3)
#> [1] TRUE

I imagine there are other ways that don’t involve a join, but that’s how I would do it using joins.

^{Created on 2022-02-17 by the reprex package (v2.0.1)}

join two dataframes where the column values (a set) is a subset of the other

Create your dataframes

import pandas as pd

df1 = pd.DataFrame({'key': [1, 1],
                    'id': [0, 1],
                    'items': [set(['foo', 'baz']), set(['bar', 'baz'])]})

df2 = pd.DataFrame({'key': [1, 1, 1, 1],
                    'items': [set(['bar', 'baz', 'foo']), set(['bar', 'baz', 'foo']), set(['bar', 'baz', 'foo']), set(['one', 'two', 'bar'])],
                    'other': [1, 2, 3, 2]
                   })

then make a cartesian product

merged_df = df1.merge(df2, on='key')
merged_df

   key  id     items_x          items_y  other
0    1   0  {baz, foo}  {foo, baz, bar}      1
1    1   0  {baz, foo}  {foo, baz, bar}      2
2    1   0  {baz, foo}  {foo, baz, bar}      3
3    1   0  {baz, foo}  {one, bar, two}      2
4    1   1  {baz, bar}  {foo, baz, bar}      1
5    1   1  {baz, bar}  {foo, baz, bar}      2
6    1   1  {baz, bar}  {foo, baz, bar}      3
7    1   1  {baz, bar}  {one, bar, two}      2

define your custom function and see if it works in one case

def check_if_all_in_list(list1, list2):
    return all(elem in list2 for elem in list1)

check_if_all_in_list(merged_df['items_x'][0], merged_df['items_y'][0])
True

Create your match

merged_df['check'] = merged_df.apply(lambda row: check_if_all_in_list(row['items_x'], row['items_y']), axis=1)
merged_df

   key  id     items_x          items_y  other  check
0    1   0  {baz, foo}  {foo, baz, bar}      1   True
1    1   0  {baz, foo}  {foo, baz, bar}      2   True
2    1   0  {baz, foo}  {foo, baz, bar}      3   True
3    1   0  {baz, foo}  {one, bar, two}      2  False
4    1   1  {baz, bar}  {foo, baz, bar}      1   True
5    1   1  {baz, bar}  {foo, baz, bar}      2   True
6    1   1  {baz, bar}  {foo, baz, bar}      3   True
7    1   1  {baz, bar}  {one, bar, two}      2  False

now filter out what you don't want

mask = (merged_df['check']==True)
merged_df[mask]

   key  id     items_x          items_y  other  check
0    1   0  {baz, foo}  {foo, baz, bar}      1   True
1    1   0  {baz, foo}  {foo, baz, bar}      2   True
2    1   0  {baz, foo}  {foo, baz, bar}      3   True
4    1   1  {baz, bar}  {foo, baz, bar}      1   True
5    1   1  {baz, bar}  {foo, baz, bar}      2   True
6    1   1  {baz, bar}  {foo, baz, bar}      3   True

Join two DataFrames on common columns only if the difference in a separate column is within range [-n, +n]

We can merge, then perform a query to drop rows not within the range:

(df1.merge(df2, on=['Date', 'BillNo.'])
    .query('abs(Amount_x - Amount_y) <= 5')
    .drop('Amount_x', axis=1))

         Date    BillNo.  Amount_y
0  10/08/2020  ABBCSQ1ZA       876
1  10/16/2020  AA171E1Z0      5491

This works well as long as there is only one row that corresponds to a specific (Date, BillNo) combination in each frame.