Join dataframes based on partial string-match between columns
Given input dataframes df1
and df2
, you can use Boolean indexing via pd.Series.isin
. To align the format of the movie strings you need to first concatenate movie and year from df1
:
s = df1['movie'] + ' (' + df1['year'].astype(str) + ')'
res = df2[df2['FILM'].isin(s)]
print(res)
FILM VOTES
4 Max Steel (2016) 560
Merge two pandas DataFrame based on partial match
Update: the fuzzywuzzy
project has been renamed to thefuzz
and moved here
You can use thefuzz
package and the function extractOne
:
# Python env: pip install thefuzz
# Anaconda env: pip install thefuzz
# -> thefuzz is not yet available on Anaconda (2021-09-18)
# -> you can use the old package: conda install -c conda-forge fuzzywuzzy
from thefuzz import process
best_city = lambda x: process.extractOne(x, df2["City"])[2] # See note below
df1['Geo'] = df2.loc[df1["City"].map(best_city).values, 'Geo'].values
Output:
>>> df1
City Val Geo
0 San Francisco, CA 1 geo1
1 Oakland, CA 2 geo1
Note: extractOne
return a tuple of 3 values from the best match: the City name from df2
[0], the accuracy score [1] and the index [2] (<- the one I use).
merge 2 dataframes based on partial string-match between columns
From a previous post.
Input data:
>>> df1
movie correct_id
0 birdman NaN
1 avengers: endgame NaN
2 deadpool NaN
3 once upon deadpool NaN
>>> df2
movie correct_id
0 birdmans 4
1 The avengers: endgame 2
2 The King 3
3 once upon a deadpool 1
A bit of fuzzy logic:
from fuzzywuzzy import process
dfm = pd.DataFrame(df1["movie"].apply(lambda x: process.extractOne(x, df2["movie"]))
.tolist(), columns=["movie", "ratio", "best_id"])
>>> dfm
movie ratio best_id
0 birdmans 93 0
1 The avengers: endgame: endgame 90 1
2 once upon a deadpool 90 3
3 once upon a deadpool 95 3
The index of dfm
is the index of df1
rather than the column best_id
is the index of df2
. Now you can update your first dataframe:
THRESHOLD = 90 # adjust this number
ids = dfm.loc[dfm["ratio"] > THRESHOLD, "best_id"]
df1["correct_id"] = df2.loc[ids, "correct_id"].astype("Int64")
>>> df1
movie correct_id
0 birdman 4
1 avengers: endgame 2
2 deadpool <NA>
3 once upon deadpool 1
Merge Dataframes Based on Partial Substrings Match
This should work too
# Split concat_address_id column with reg expression
df2['address_id_1'] = 'address' + df2['concat_address_id'].str.split('address').str.get(1)
df2['address_id_2'] = 'address' + df2['concat_address_id'].str.split('address').str.get(2)
# Create empty address_id column to merge with df1
df2['address_id'] = ''
# Filter out address id missing from df1
df2.loc[~df2['address_id_1'].isin(list(df1['address_id'])),'address_id'] = df2['address_id_2']
# Set value in address_id column
df2.loc[df2['address_id_1'].isin(list(df1['address_id'])),'address_id'] = df2['address_id_1']
concat_address_id last_login country_of_login address_id_1 address_id_2 address_id
0 address1address5 15/10/2020 CN address1 address5 address1
1 address6address2 18/02/2020 NL address6 address2 address2
2 address3address5 13/05/2019 BR address3 address5 address3
3 address6address4 18/06/2020 NL address6 address4 address4
4 address5address8 13/05/2019 RU address5 address8 address5
# Merge df1 and df2
df_final = pd.merge(df1,df2[['address_id', 'last_login', 'country_of_login']],
on='address_id',how='left')
process sku address_id customer country last_login country_of_login
0 process1 sku1 address1 customer5 BR 15/10/2020 CN
1 process1 sku2 address2 customer5 BR 18/02/2020 NL
2 process1 sku3 address3 customer5 BR 13/05/2019 BR
3 process1 sku4 address4 customer5 BR 18/06/2020 NL
4 process1 sku5 address5 customer5 BR 13/05/2019 RU
Pandas: join on partial string match, like Excel VLOOKUP
This is one way using pd.Series.apply
, which is just a thinly veiled loop. A "partial string merge" is what you are looking for, I'm not sure it exists in a vectorised form.
df4 = df1.copy()
def get_amount(x):
return df2.loc[df2['Ref'].str.contains(x), 'Amount'].iloc[0]
df4['Amount'] = df4['Invoice'].apply(get_amount)
print(df4)
Currency Invoice Amount
0 EUR 20561 150
1 EUR 20562 175
2 EUR 20563 160
3 USD 20564 180
Related Topics
Python: How to Keep Repeating a Program Until a Specific Input Is Obtained
How to Convert Column With String Type to Int Form in Pyspark Data Frame
Python Replace Single Quotes Except Apostrophes
What Is the Most Efficient Way to Sum a Dict With Multiple Keys by One Key
How to Copy a File to a Remote Server in Python Using Scp or Ssh
How to Add a Delay to Message.Delete()
How to Resolve Modulenotfounderror: No Module Named 'Google.Colab'
Create a New Dataframe Based on Rows With a Certain Value
Python Dataframe Query With Spaces in Column Name
Matching Text Between a Pair of Single Quotes
Concatenate Two Columns in Csv: Python
Overlay a Smaller Image on a Larger Image Python Opencv
How to Delete a Column That Contains Only Zeros in Pandas
How to Change a Dataframe Column from String Type to Double Type in Pyspark
Pyodbc Error Data Source Name Not Found and No Default Driver Specified Paradox
Python:Compare Two CSV Files and Print Out Differences
How to Make Tkinter Frames in a Loop and Update Object Values
High Pass Filter for Image Processing in Python by Using Scipy/Numpy