Find the similarity metric between two strings
There is a built in.
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
Using it:
>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0
Compare Similarity of two strings
You can try fuzzywuzzy
with score , then you just need to set up score limit for cut
from fuzzywuzzy import fuzz
df['score'] = df[['Name Left','Name Right']].apply(lambda x : fuzz.partial_ratio(*x),axis=1)
df
Out[134]:
Match ID Name Left Name Right score
0 1 LemonFarms Lemon Farms Inc 90
1 2 Peachtree PeachTree Farms 89
2 3 Tomato Grove Orange Cheetah Farm 13
Abbreviation similarity between strings
You can use a recursive algorithm, similar to sequence alignment. Just don't give penalty for shifts (as they are expected in abbreviations) but give one for mismatch in first characters.
This one should work, for example:
def abbreviation(abr,word,penalty=1):
if len(abr)==0:
return 0
elif len(word)==0:
return penalty*len(abr)*-1
elif abr[0] == word[0]:
if len(abr)>1:
return 1 + max(abbreviation(abr[1:],word[1:]),
abbreviation(abr[2:],word[1:])-penalty)
else:
return 1 + abbreviation(abr[1:],word[1:])
else:
return abbreviation(abr,word[1:])
def compute_match(abbr,word,penalty=1):
score = abbreviation(abbr.lower(),
word.lower(),
penalty)
if abbr[0].lower() != word[0].lower(): score-=penalty
score = score/len(abbr)
return score
print(compute_match("wtw", "willis tower watson"))
print(compute_match("wtwo", "willis tower watson"))
print(compute_match("stove", "Stackoverflow"))
print(compute_match("tov", "Stackoverflow"))
print(compute_match("wtwx", "willis tower watson"))
The output is:
1.0
1.0
1.0
0.6666666666666666
0.5
Indicating that wtw
and wtwo
are perfectly valid abbreviations for willistowerwatson
, that stove
is a valid abbreviation of Stackoverflow
but not tov
, which has the wrong first character.
And wtwx
is only partially valid abbreviation for willistowerwatson
beacuse it ends with a character that does not occur in the full name.
String similarity metrics in Python
There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.
http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
Here's a bit of the list:
- Hamming distance
- Levenshtein distance
- Needleman-Wunch distance or Sellers Algorithm
- and many more...
How to compare similarity between two strings (other than English language) in Python
You can use a SequenceMatcher
from the built-in module difflib
Code example:
import difflib
print(difflib.SequenceMatcher(None, "ਬੁੱਧਵਾਰ", "ਬੁੱਧਵਾ").ratio())
Output:
0.9230769230769231
Related Topics
Read File with Timeout in Python
How to Select a Specific Input Device with Pyaudio
Schedule Python Script with Crontab
How to Push a Subprocess.Call() Output to Terminal and File
How to Check the Operating System in Python
Run a Python Script in Terminal Without the Python Command
On Linux Suse or Redhat, How to Load Python 2.7
Standalone Python Applications in Linux
Multiprocessing: Use Only the Physical Cores
Dropping Root Permissions in Python
Changing the Process Name of a Python Script
Ignore Case in Glob() on Linux
Python Multiprocessing Memory Usage
Give the Python Terminal a Persistent History
Move and Zoom a Tkinter Canvas with Mouse