String Similarity Metrics in Python

String similarity metrics in Python

There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.

http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html

Here's a bit of the list:

Hamming distance
Levenshtein distance
Needleman-Wunch distance or Sellers Algorithm
and many more...

Find the similarity metric between two strings

There is a built in.

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

Using it:

>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0

Most efficient string similarity metric function

edlib seems to be fast enough for my use case.

It's a C++ lib with Python bindings that calculates the Levehnstein distance for texts <100kb in less than 10ms each (on my machine). 10kb texts are done in ~1ms, which is 100x faster than difflib.SequenceMatcher.

Similarity measure for Strings in Python

You could just use difflib. This function I got from an answer some time ago has served me well:

from difflib import SequenceMatcher

def similar(a, b):
    return SequenceMatcher(None, a, b).ratio()

print (similar('tackoverflow','stackoverflow'))
print (similar('h0t','hot'))

0.96
0.666666666667

You could easily append the function or wrap it in another function to account for different degrees of similarities, like so, passing a third argument:

from difflib import SequenceMatcher

def similar(a, b, c):
    sim = SequenceMatcher(None, a, b).ratio()
    if sim > c: 
        return sim

print (similar('tackoverflow','stackoverflow', 0.9))
print (similar('h0t','hot', 0.9))

0.96
None

Efficient way of generating new columns having string similarity distances between two string columns

Starting with a dataframe that looks like:

first_name	address	city	state	zip	url	phone	categories	first_name_2	address_2	city_2	state_2	zip_2	url_2	phone_2	categories_2
Rori	680 Buell Crossing	Dallas	Texas	75277	url_shortened	214-533-2179	Granite Surfaces	Agustin	7 Schiller Crossing	Lubbock	Texas	79410	url_shortened	806-729-7419	Roofing (Metal)
Dmitri	05 Coolidge Way	Charleston	West Virginia	25356	url_shortened	304-906-6384	Structural and Misc Steel (Fabrication)	Kearney	0547 Clemons Plaza	Peoria	Illinois	61651	url_shortened	309-326-4252	Framing (Steel)

String similarity in Python

What you're trying to do has already been implemented very well in the jellyfish package.

>>> import jellyfish
>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
2

What string similarity algorithms are there?

It seems you are needing some kind of fuzzy matching. Here is java implementation of some set of similarity metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Here is more detailed explanation of string metrics http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf it depends on how fuzzy and how fast your implementation must be.

String Similarity Metrics in Python