Find the Similarity Metric Between Two Strings

Find the similarity metric between two strings

There is a built in.

from difflib import SequenceMatcher

def similar(a, b):
return SequenceMatcher(None, a, b).ratio()

Using it:

>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0

Compare Similarity of two strings

You can try fuzzywuzzy with score , then you just need to set up score limit for cut

from fuzzywuzzy import fuzz
df['score'] = df[['Name Left','Name Right']].apply(lambda x : fuzz.partial_ratio(*x),axis=1)
df
Out[134]:
Match ID Name Left Name Right score
0 1 LemonFarms Lemon Farms Inc 90
1 2 Peachtree PeachTree Farms 89
2 3 Tomato Grove Orange Cheetah Farm 13

Abbreviation similarity between strings

You can use a recursive algorithm, similar to sequence alignment. Just don't give penalty for shifts (as they are expected in abbreviations) but give one for mismatch in first characters.

This one should work, for example:

def abbreviation(abr,word,penalty=1):
if len(abr)==0:
return 0
elif len(word)==0:
return penalty*len(abr)*-1
elif abr[0] == word[0]:
if len(abr)>1:
return 1 + max(abbreviation(abr[1:],word[1:]),
abbreviation(abr[2:],word[1:])-penalty)
else:
return 1 + abbreviation(abr[1:],word[1:])
else:
return abbreviation(abr,word[1:])

def compute_match(abbr,word,penalty=1):
score = abbreviation(abbr.lower(),
word.lower(),
penalty)
if abbr[0].lower() != word[0].lower(): score-=penalty

score = score/len(abbr)

return score

print(compute_match("wtw", "willis tower watson"))
print(compute_match("wtwo", "willis tower watson"))
print(compute_match("stove", "Stackoverflow"))
print(compute_match("tov", "Stackoverflow"))
print(compute_match("wtwx", "willis tower watson"))

The output is:

1.0
1.0
1.0
0.6666666666666666
0.5

Indicating that wtw and wtwo are perfectly valid abbreviations for willistowerwatson, that stove is a valid abbreviation of Stackoverflow but not tov, which has the wrong first character.
And wtwx is only partially valid abbreviation for willistowerwatson beacuse it ends with a character that does not occur in the full name.

String similarity metrics in Python

There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.

http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

Here's a bit of the list:

  • Hamming distance
  • Levenshtein distance
  • Needleman-Wunch distance or Sellers Algorithm
  • and many more...

How to compare similarity between two strings (other than English language) in Python

You can use a SequenceMatcher from the built-in module difflib

Code example:

import difflib

print(difflib.SequenceMatcher(None, "ਬੁੱਧਵਾਰ", "ਬੁੱਧਵਾ").ratio())

Output:

0.9230769230769231


Related Topics



Leave a reply



Submit