How to Do Fuzzy Match Merge With Python Pandas

is it possible to do fuzzy match merge with python pandas?

Similar to @locojay suggestion, you can apply difflib's get_close_matches to df2's index and then apply a join:

In [23]: import difflib 

In [24]: difflib.get_close_matches
Out[24]: <function difflib.get_close_matches>

In [25]: df2.index = df2.index.map(lambda x: difflib.get_close_matches(x, df1.index)[0])

In [26]: df2
Out[26]:
letter
one a
two b
three c
four d
five e

In [31]: df1.join(df2)
Out[31]:
number letter
one 1 a
two 2 b
three 3 c
four 4 d
five 5 e

.

If these were columns, in the same vein you could apply to the column then merge:

df1 = DataFrame([[1,'one'],[2,'two'],[3,'three'],[4,'four'],[5,'five']], columns=['number', 'name'])
df2 = DataFrame([['a','one'],['b','too'],['c','three'],['d','fours'],['e','five']], columns=['letter', 'name'])

df2['name'] = df2['name'].apply(lambda x: difflib.get_close_matches(x, df1['name'])[0])
df1.merge(df2)

How to do fuzzy match merge to match based on a few columns

Solution one:

If your data is as clean as you claim (there are no typo in the names in the example), then you can do this:

# Cleaning the capitalization error
df1["name"] = df1["name"].str.lower()
df2["name"] = df2["name"].str.lower()

df_total = df1.append(df2,ignore_index=True)

df_total = df_total.groupby(["store code","name"]).first()

Solution two (if you have typo in the string values):

But if there are typo in the names and you want to merge them according to fuzzy matching, then you need to follow this:

  1. We need these libraries to help us:

import pandas as pd
import networkx as nx
from fuzzywuzzy import fuzz
import itertools
from itertools import permutations

Lets match the cases so we are on the safe side:

df1["name"] = df1["name"].str.lower()
df2["name"] = df2["name"].str.lower()

Then lets start matching!

We need to make all combinations of the two names in dataframes (source) and make a dataframe out of it so we can use apply that is much faster than for loop:

combs = list(itertools.product(df1["name"], df2["name"]))
combs = pd.DataFrame(combs)

Then we score each combination. The WRatio will do just fine, but you can use your custom made functions for matching:

combs['score'] = combs.apply(lambda x: fuzz.WRatio(x[0],x[1]), axis=1)

Now, lets make a graph out of it. I used the min score of 90 as the criteria. you can use which ever that suits you the best:

threshold = 90
G_name = nx.from_pandas_edgelist(combs[combs['score']>=threshold],0,1, create_using=nx.Graph)

If names fit the matching criteria, then they will become connected in our graph. So each interconnected cluster represent same name. With this information we can create a dictionary that replaces all deviations of a single name in our data to a unique one.

This code is a bit complex. In short, it creates a dataframe which each row is one name and for columns has its variations. Then it melts the dataframe and create a dictionary that has deviation of names as key and the unique representation of a name as value. This dictionary allows us to replace all deviated names in your dataframe with unique one so the groupby can function correctly:

connected_names=pd.DataFrame()
for cluster in nx.connected_components(G_name):
if len(cluster) != 1:
connected_names = connected_names.append([list(cluster)])
connected_names = connected_names\
.reset_index(drop=True)\
.melt(id_vars=0)\
.drop('variable', axis=1)\
.dropna()\
.reset_index(drop=True)\
.set_index('value')

names_dict = connected_names.to_dict()[0]

Now we have the dictionary. All that remains is replacing the names and use the groupby method:

df1["name"] = df1["name"].replace(names_dict)
df2["name"] = df2["name"].replace(names_dict)

df_total = df1.append(df2,ignore_index=True)

df_total = df_total.groupby(["store code","name"]).first()

Perform Fuzzy Matching in 2 pandas dataframe

You can use the text matching capabilities of the fuzzywuzzy library mixed with pandas functions in python.

First, import the following libraries :

import pandas as pd
import numpy as np
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

You can use the text matching capabilities of the fuzzywuzzy python library :

#get list of unique teams existing in df1
lst_teams = list(np.unique(np.array(df1['Team'])))
#define arbitrary threshold
thres = 70
#for each team match similar texts
for team in lst_teams:
#iterration on dataframe filtered by team
for index, row in df1.loc[df1['Team']==team].iterrows():
#get list of players in this team
lst_player_per_team = list(np.array(df2.loc[df2['Team']==team]['Player']))
#use of fuzzywuzzy to make text matching
output_ratio = process.extract(row['Player'], lst_player_per_team, scorer=fuzz.token_sort_ratio)
#check if there is players from df2 in this team
if output_ratio !=[]:
#put arbitrary threshold to get most similar text
if output_ratio[0][1]>thres:
df1.loc[index, 'Age'] = df2.loc[(df2['Team']==team)&(df2['Player']==output_ratio[0][0])]['Age'].values[0]
df1 = df1.fillna('XX')

with this code and a threshold defined as 70, you get the following result:

print(df1)
Player Team Age
0 John Sepi A 22
1 Zan Fred C XX
2 Mark Daniel E 21
3 Adam Pop C XX
4 Paul Sepi B XX
5 John Hernandez D 26
6 Price Josiah B 18
7 John Hernandez A 19
8 Adam Pop D 25

It is possible to move the threshold to increase the accuracy of the text matching capabilities between the two dataframes.

Please note that you should be careful when using .iterrows() as iteration on a dataframe is not advised.

You can check the fuzzywuzzy doc here https://pypi.org/project/fuzzywuzzy/

Pandas fast fuzzy match

import fuzzymatcher
import pandas as pd

df_left = pd.DataFrame({'id2': ['1', '2'], 'name': ['paris city', 'london town']})

df_right = pd.DataFrame({'id2': ['3', '4'], 'name': ['parid cit', 'londoon town']})

fuzzymatcher.fuzzy_left_join(df_left, df_right, left_on = "name", right_on = "name")

This is the address:https://github.com/RobinL/fuzzymatcher

how to 'fuzzy' match strings when merge two dataframe in pandas

I am using fuzzywuzzy here

from fuzzywuzzy import fuzz
from fuzzywuzzy import process



df2['key']=df2.Name.apply(lambda x : [process.extract(x, df1.Name, limit=1)][0][0][0])

df2.merge(df1,left_on='key',right_on='Name')
Out[1238]:
Name_x gender key Age Name_y
0 adam Smith M Adam Smith 43 Adam Smith
1 Annie Kim F Anne Kim 21 Anne Kim
2 John Weber M John Weber 55 John Weber
3 Ian Ford M Ian Ford 24 Ian Ford

Fuzzy match columns and merge/join dataframes

For those who need this. Here's a solution I came up with.

merge = pd.merge(df, df2, left_on=['matches'],right_on=['Key'],how='outer').fillna(0)

From there you can drop unnecessary or duplicate columns and get a clean result like so:

clean = merge.drop(['matches', 'Key_y'], axis=1)



Related Topics



Leave a reply



Submit