String Similarity Metrics in Python

String similarity metrics in Python

There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.

http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

Here's a bit of the list:

  • Hamming distance
  • Levenshtein distance
  • Needleman-Wunch distance or Sellers Algorithm
  • and many more...

Find the similarity metric between two strings

There is a built in.

from difflib import SequenceMatcher

def similar(a, b):
return SequenceMatcher(None, a, b).ratio()

Using it:

>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0

Most efficient string similarity metric function

edlib seems to be fast enough for my use case.

It's a C++ lib with Python bindings that calculates the Levehnstein distance for texts <100kb in less than 10ms each (on my machine). 10kb texts are done in ~1ms, which is 100x faster than difflib.SequenceMatcher.

Similarity measure for Strings in Python

You could just use difflib. This function I got from an answer some time ago has served me well:

from difflib import SequenceMatcher

def similar(a, b):
return SequenceMatcher(None, a, b).ratio()

print (similar('tackoverflow','stackoverflow'))
print (similar('h0t','hot'))

0.96
0.666666666667

You could easily append the function or wrap it in another function to account for different degrees of similarities, like so, passing a third argument:

from difflib import SequenceMatcher

def similar(a, b, c):
sim = SequenceMatcher(None, a, b).ratio()
if sim > c:
return sim

print (similar('tackoverflow','stackoverflow', 0.9))
print (similar('h0t','hot', 0.9))

0.96
None

Efficient way of generating new columns having string similarity distances between two string columns

Starting with a dataframe that looks like:





























































first_nameaddresscitystatezipurlphonecategoriesfirst_name_2address_2city_2state_2zip_2url_2phone_2categories_2
Rori680 Buell CrossingDallasTexas75277url_shortened214-533-2179Granite SurfacesAgustin7 Schiller CrossingLubbockTexas79410url_shortened806-729-7419Roofing (Metal)
Dmitri05 Coolidge WayCharlestonWest Virginia25356url_shortened304-906-6384Structural and Misc Steel (Fabrication)Kearney0547 Clemons PlazaPeoriaIllinois61651url_shortened309-326-4252Framing (Steel)

String similarity in Python

What you're trying to do has already been implemented very well in the jellyfish package.

>>> import jellyfish
>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
2

What string similarity algorithms are there?

It seems you are needing some kind of fuzzy matching. Here is java implementation of some set of similarity metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Here is more detailed explanation of string metrics http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf it depends on how fuzzy and how fast your implementation must be.



Related Topics



Leave a reply



Submit