Find the similarity metric between two strings
There is a built in.
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
Using it:
>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0
Speed up matching strings python
So I was able to speed up the matching step by using the postal code column as discriminant. I was able to goes from 1h40 to 7mn of computation.
Below are just some sample of DF's
df1 (127000,3)
Code Name PostalCode
150 Maarc 47111
250 Kirc 41111
170 Moic 42111
140 Nirc 44111
550 Lacter 47111
df2 (38000,3)
Code NAME POSTAL_CODE
150 Marc 47111
250 Kikc 41111
170 Mosc 49111
140 NiKc 44111
550 Lacter 47111
And below is the code that matches the Name column and retrieve the name with the best score
%%time
import difflib
from functools import partial
def difflib_match (df1, df2, set_nan = True):
# Fill NaN
df2['best']= np.nan
df2['score']= np.nan
# Apply function to retrieve unique first letter of Name's column
first= df2['POSTAL_CODE'].unique()
# Loop over each first letter to apply the matching by starting with the same Postal code for both DF
for m, letter in enumerate(first):
# IF Divid by 100, print Unique values processed
if m%100 == 0:
print(m, 'of', len(first))
df1_first= df1[df1['PostalCode'] == letter]
df2_first= df2[df2['POSTAL_CODE'] == letter]
# Function to match using the Name column from the Web
f = partial(difflib.get_close_matches, possibilities= df1_first['Name'].tolist(), n=1)
# Define which columns to compare while mapping with first letter
matches = df2_first['NAME'].map(f).str[0].fillna('')
# Retrieve the best score for each match
scores = [difflib.SequenceMatcher(None, x, y).ratio()
for x, y in zip(matches, df2_first['NAME'])]
# Assign the result to the DF
for i, name in enumerate(df2_first['NAME']):
df2['best'].where(df2['NAME'] != name, matches.iloc[i], inplace = True)
df2['score'].where(df2['NAME'] != name, scores[i], inplace = True)
return df2
# Apply Function
df_diff= difflib_match(df1, df2)
# Display DF
print('Shape: ', df_diff.shape)
df_diff.head()
When to use which fuzz function to compare 2 strings
Great question.
I'm an engineer at SeatGeek, so I think I can help here. We have a great blog post that explains the differences quite well, but I can summarize and offer some insight into how we use the different types.
Overview
Under the hood each of the four methods calculate the edit distance between some ordering of the tokens in both input strings. This is done using the difflib.ratio
function which will:
Return a measure of the sequences' similarity (float in [0,1]).
Where T is the total number of elements in both sequences, and M is
the number of matches, this is 2.0*M / T. Note that this is 1 if the
sequences are identical, and 0 if they have nothing in common.
The four fuzzywuzzy methods call difflib.ratio
on different combinations of the input strings.
fuzz.ratio
Simple. Just calls difflib.ratio
on the two input strings (code).
fuzz.ratio("NEW YORK METS", "NEW YORK MEATS")
> 96
fuzz.partial_ratio
Attempts to account for partial string matches better. Calls ratio
using the shortest string (length n) against all n-length substrings of the larger string and returns the highest score (code).
Notice here that "YANKEES" is the shortest string (length 7), and we run the ratio with "YANKEES" against all substrings of length 7 of "NEW YORK YANKEES" (which would include checking against "YANKEES", a 100% match):
fuzz.ratio("YANKEES", "NEW YORK YANKEES")
> 60
fuzz.partial_ratio("YANKEES", "NEW YORK YANKEES")
> 100
fuzz.token_sort_ratio
Attempts to account for similar strings out of order. Calls ratio
on both strings after sorting the tokens in each string (code). Notice here fuzz.ratio
and fuzz.partial_ratio
both fail, but once you sort the tokens it's a 100% match:
fuzz.ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets")
> 45
fuzz.partial_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets")
> 45
fuzz.token_sort_ratio("New York Mets vs Atlanta Braves", "Atlanta Braves vs New York Mets")
> 100
fuzz.token_set_ratio
Attempts to rule out differences in the strings. Calls ratio on three particular substring sets and returns the max (code):
- intersection-only and the intersection with remainder of string one
- intersection-only and the intersection with remainder of string two
- intersection with remainder of one and intersection with remainder of two
Notice that by splitting up the intersection and remainders of the two strings, we're accounting for both how similar and different the two strings are:
fuzz.ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 36
fuzz.partial_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 61
fuzz.token_sort_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 51
fuzz.token_set_ratio("mariners vs angels", "los angeles angels of anaheim at seattle mariners")
> 91
Application
This is where the magic happens. At SeatGeek, essentially we create a vector score with each ratio for each data point (venue, event name, etc) and use that to inform programatic decisions of similarity that are specific to our problem domain.
That being said, truth by told it doesn't sound like FuzzyWuzzy is useful for your use case. It will be tremendiously bad at determining if two addresses are similar. Consider two possible addresses for SeatGeek HQ: "235 Park Ave Floor 12" and "235 Park Ave S. Floor 12":
fuzz.ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 93
fuzz.partial_ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 85
fuzz.token_sort_ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 95
fuzz.token_set_ratio("235 Park Ave Floor 12", "235 Park Ave S. Floor 12")
> 100
FuzzyWuzzy gives these strings a high match score, but one address is our actual office near Union Square and the other is on the other side of Grand Central.
For your problem you would be better to use the Google Geocoding API.
Related Topics
Python Regular Expression Re.Match, Why This Code Does Not Work
Calling Dot Products and Linear Algebra Operations in Cython
List' Object Has No Attribute 'Get_Attribute' While Iterating Through Webelements
Access to Table Objects on Webpage Using Python Selenium
How to Get Tkinter Canvas to Dynamically Resize to Window Width
Django Query That Get Most Recent Objects from Different Categories
Python: Mocking a Context Manager
Python 'If X Is Not None' or 'If Not X Is None'
How to Add Conda Environment to Jupyter Lab
Calculating Pearson Correlation and Significance in Python
Using Cprofile Results with Kcachegrind
Pandas Finding Local Max and Min
Python Datetime to String Without Microsecond Component
Pandas Concat Generates Nan Values
Python - How to Convert JSON File to Dataframe