Join Dataframes Based on Partial String-Match Between Columns

Join dataframes based on partial string-match between columns

Given input dataframes df1 and df2, you can use Boolean indexing via pd.Series.isin. To align the format of the movie strings you need to first concatenate movie and year from df1:

s = df1['movie'] + ' (' + df1['year'].astype(str) + ')'

res = df2[df2['FILM'].isin(s)]

print(res)

FILM VOTES
4 Max Steel (2016) 560

Merge two pandas DataFrame based on partial match

Update: the fuzzywuzzy project has been renamed to thefuzz and moved here

You can use thefuzz package and the function extractOne:

# Python env: pip install thefuzz
# Anaconda env: pip install thefuzz
# -> thefuzz is not yet available on Anaconda (2021-09-18)
# -> you can use the old package: conda install -c conda-forge fuzzywuzzy

from thefuzz import process

best_city = lambda x: process.extractOne(x, df2["City"])[2] # See note below
df1['Geo'] = df2.loc[df1["City"].map(best_city).values, 'Geo'].values

Output:

>>> df1
City Val Geo
0 San Francisco, CA 1 geo1
1 Oakland, CA 2 geo1

Note: extractOne return a tuple of 3 values from the best match: the City name from df2 [0], the accuracy score [1] and the index [2] (<- the one I use).

merge 2 dataframes based on partial string-match between columns

From a previous post.

Input data:

>>> df1
movie correct_id
0 birdman NaN
1 avengers: endgame NaN
2 deadpool NaN
3 once upon deadpool NaN

>>> df2
movie correct_id
0 birdmans 4
1 The avengers: endgame 2
2 The King 3
3 once upon a deadpool 1

A bit of fuzzy logic:

from fuzzywuzzy import process

dfm = pd.DataFrame(df1["movie"].apply(lambda x: process.extractOne(x, df2["movie"]))
.tolist(), columns=["movie", "ratio", "best_id"])
>>> dfm
movie ratio best_id
0 birdmans 93 0
1 The avengers: endgame: endgame 90 1
2 once upon a deadpool 90 3
3 once upon a deadpool 95 3

The index of dfm is the index of df1 rather than the column best_id is the index of df2. Now you can update your first dataframe:

THRESHOLD = 90  # adjust this number

ids = dfm.loc[dfm["ratio"] > THRESHOLD, "best_id"]
df1["correct_id"] = df2.loc[ids, "correct_id"].astype("Int64")
>>> df1
movie correct_id
0 birdman 4
1 avengers: endgame 2
2 deadpool <NA>
3 once upon deadpool 1

Merge Dataframes Based on Partial Substrings Match

This should work too

# Split concat_address_id column with reg expression
df2['address_id_1'] = 'address' + df2['concat_address_id'].str.split('address').str.get(1)
df2['address_id_2'] = 'address' + df2['concat_address_id'].str.split('address').str.get(2)

# Create empty address_id column to merge with df1
df2['address_id'] = ''

# Filter out address id missing from df1
df2.loc[~df2['address_id_1'].isin(list(df1['address_id'])),'address_id'] = df2['address_id_2']

# Set value in address_id column
df2.loc[df2['address_id_1'].isin(list(df1['address_id'])),'address_id'] = df2['address_id_1']

concat_address_id last_login country_of_login address_id_1 address_id_2 address_id
0 address1address5 15/10/2020 CN address1 address5 address1
1 address6address2 18/02/2020 NL address6 address2 address2
2 address3address5 13/05/2019 BR address3 address5 address3
3 address6address4 18/06/2020 NL address6 address4 address4
4 address5address8 13/05/2019 RU address5 address8 address5

# Merge df1 and df2
df_final = pd.merge(df1,df2[['address_id', 'last_login', 'country_of_login']],
on='address_id',how='left')

process sku address_id customer country last_login country_of_login
0 process1 sku1 address1 customer5 BR 15/10/2020 CN
1 process1 sku2 address2 customer5 BR 18/02/2020 NL
2 process1 sku3 address3 customer5 BR 13/05/2019 BR
3 process1 sku4 address4 customer5 BR 18/06/2020 NL
4 process1 sku5 address5 customer5 BR 13/05/2019 RU

Pandas: join on partial string match, like Excel VLOOKUP

This is one way using pd.Series.apply, which is just a thinly veiled loop. A "partial string merge" is what you are looking for, I'm not sure it exists in a vectorised form.

df4 = df1.copy()

def get_amount(x):
return df2.loc[df2['Ref'].str.contains(x), 'Amount'].iloc[0]

df4['Amount'] = df4['Invoice'].apply(get_amount)

print(df4)

Currency Invoice Amount
0 EUR 20561 150
1 EUR 20562 175
2 EUR 20563 160
3 USD 20564 180


Related Topics



Leave a reply



Submit