Join Dataframes Based on Partial String-Match Between Columns

Join dataframes based on partial string-match between columns

Given input dataframes df1 and df2, you can use Boolean indexing via pd.Series.isin. To align the format of the movie strings you need to first concatenate movie and year from df1:

s = df1['movie'] + ' (' + df1['year'].astype(str) + ')'

res = df2[df2['FILM'].isin(s)]

print(res)

               FILM  VOTES
4  Max Steel (2016)    560

Merge two pandas DataFrame based on partial match

Update: the fuzzywuzzy project has been renamed to thefuzz and moved here

You can use thefuzz package and the function extractOne:

# Python env: pip install thefuzz
# Anaconda env: pip install thefuzz
# -> thefuzz is not yet available on Anaconda (2021-09-18)
# -> you can use the old package: conda install -c conda-forge fuzzywuzzy

from thefuzz import process

best_city = lambda x: process.extractOne(x, df2["City"])[2]  # See note below
df1['Geo'] = df2.loc[df1["City"].map(best_city).values, 'Geo'].values

Output:

>>> df1
                City  Val   Geo
0  San Francisco, CA    1  geo1
1        Oakland, CA    2  geo1

Note: extractOne return a tuple of 3 values from the best match: the City name from df2 [0], the accuracy score [1] and the index [2] (<- the one I use).

merge 2 dataframes based on partial string-match between columns

From a previous post.

Input data:

>>> df1
                movie  correct_id
0             birdman         NaN
1   avengers: endgame         NaN
2            deadpool         NaN
3  once upon deadpool         NaN

>>> df2
                   movie  correct_id
0               birdmans           4
1  The avengers: endgame           2
2               The King           3
3   once upon a deadpool           1

A bit of fuzzy logic:

from fuzzywuzzy import process

dfm = pd.DataFrame(df1["movie"].apply(lambda x: process.extractOne(x, df2["movie"]))
                               .tolist(), columns=["movie", "ratio", "best_id"])

>>> dfm
                            movie  ratio  best_id
0                        birdmans     93        0
1  The avengers: endgame: endgame     90        1
2            once upon a deadpool     90        3
3            once upon a deadpool     95        3

The index of dfm is the index of df1 rather than the column best_id is the index of df2. Now you can update your first dataframe:

THRESHOLD = 90  # adjust this number

ids = dfm.loc[dfm["ratio"] > THRESHOLD, "best_id"]
df1["correct_id"] = df2.loc[ids, "correct_id"].astype("Int64")

>>> df1
                movie  correct_id
0             birdman           4
1   avengers: endgame           2
2            deadpool        <NA>
3  once upon deadpool           1

Merge Dataframes Based on Partial Substrings Match

This should work too

# Split concat_address_id column with reg expression
df2['address_id_1'] = 'address' + df2['concat_address_id'].str.split('address').str.get(1)
df2['address_id_2'] = 'address' + df2['concat_address_id'].str.split('address').str.get(2)

# Create empty address_id column to merge with df1
df2['address_id'] = ''

# Filter out address id missing from df1
df2.loc[~df2['address_id_1'].isin(list(df1['address_id'])),'address_id'] = df2['address_id_2']

# Set value in address_id column 
df2.loc[df2['address_id_1'].isin(list(df1['address_id'])),'address_id'] = df2['address_id_1']

concat_address_id   last_login  country_of_login    address_id_1    address_id_2    address_id
0   address1address5    15/10/2020  CN                  address1    address5    address1
1   address6address2    18/02/2020  NL                  address6    address2    address2
2   address3address5    13/05/2019  BR                  address3    address5    address3
3   address6address4    18/06/2020  NL                  address6    address4    address4
4   address5address8    13/05/2019  RU                  address5    address8    address5

# Merge df1 and df2
df_final = pd.merge(df1,df2[['address_id', 'last_login', 'country_of_login']],
                    on='address_id',how='left')

    process     sku     address_id  customer    country last_login  country_of_login
0   process1    sku1    address1    customer5   BR      15/10/2020  CN
1   process1    sku2    address2    customer5   BR      18/02/2020  NL
2   process1    sku3    address3    customer5   BR      13/05/2019  BR
3   process1    sku4    address4    customer5   BR      18/06/2020  NL
4   process1    sku5    address5    customer5   BR      13/05/2019  RU

Pandas: join on partial string match, like Excel VLOOKUP

This is one way using pd.Series.apply, which is just a thinly veiled loop. A "partial string merge" is what you are looking for, I'm not sure it exists in a vectorised form.

df4 = df1.copy()

def get_amount(x):
    return df2.loc[df2['Ref'].str.contains(x), 'Amount'].iloc[0]

df4['Amount'] = df4['Invoice'].apply(get_amount)

print(df4)

  Currency Invoice Amount
0      EUR   20561    150
1      EUR   20562    175
2      EUR   20563    160
3      USD   20564    180

Join Dataframes Based on Partial String-Match Between Columns