String similarity metrics in Python
There's a great resource for string similarity metrics at the University of Sheffield. It has a list of various metrics (beyond just Levenshtein) and has open-source implementations of them. Looks like many of them should be easy to adapt into Python.
http://web.archive.org/web/20081224234350/http://www.dcs.shef.ac.uk/~sam/stringmetrics.html
Here's a bit of the list:
- Hamming distance
- Levenshtein distance
- Needleman-Wunch distance or Sellers Algorithm
- and many more...
Find the similarity metric between two strings
There is a built in.
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
Using it:
>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0
Most efficient string similarity metric function
edlib
seems to be fast enough for my use case.
It's a C++ lib with Python bindings that calculates the Levehnstein distance for texts <100kb in less than 10ms each (on my machine). 10kb texts are done in ~1ms, which is 100x faster than difflib.SequenceMatcher
.
Similarity measure for Strings in Python
You could just use difflib. This function I got from an answer some time ago has served me well:
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
print (similar('tackoverflow','stackoverflow'))
print (similar('h0t','hot'))
0.96
0.666666666667
You could easily append the function or wrap it in another function to account for different degrees of similarities, like so, passing a third argument:
from difflib import SequenceMatcher
def similar(a, b, c):
sim = SequenceMatcher(None, a, b).ratio()
if sim > c:
return sim
print (similar('tackoverflow','stackoverflow', 0.9))
print (similar('h0t','hot', 0.9))
0.96
None
Efficient way of generating new columns having string similarity distances between two string columns
Starting with a dataframe that looks like:
first_name | address | city | state | zip | url | phone | categories | first_name_2 | address_2 | city_2 | state_2 | zip_2 | url_2 | phone_2 | categories_2 |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Rori | 680 Buell Crossing | Dallas | Texas | 75277 | url_shortened | 214-533-2179 | Granite Surfaces | Agustin | 7 Schiller Crossing | Lubbock | Texas | 79410 | url_shortened | 806-729-7419 | Roofing (Metal) |
Dmitri | 05 Coolidge Way | Charleston | West Virginia | 25356 | url_shortened | 304-906-6384 | Structural and Misc Steel (Fabrication) | Kearney | 0547 Clemons Plaza | Peoria | Illinois | 61651 | url_shortened | 309-326-4252 | Framing (Steel) |
String similarity in Python
What you're trying to do has already been implemented very well in the jellyfish package.
>>> import jellyfish
>>> jellyfish.levenshtein_distance('jellyfish', 'smellyfish')
2
What string similarity algorithms are there?
It seems you are needing some kind of fuzzy matching. Here is java implementation of some set of similarity metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Here is more detailed explanation of string metrics http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf it depends on how fuzzy and how fast your implementation must be.
Related Topics
How to Interact with the Recaptcha Audio Element Using Selenium and Python
Multiprocessing - Pipe VS Queue
Differencebetween JSON.Load() and JSON.Loads() Functions
How to Remove Specific Tag/Sticker/Object from Images Using Opencv
How to Get a List of All Indices of Repeated Elements in a Numpy Array
Tkinter Grid_Forget Is Clearing the Frame
Print List of Lists in Separate Lines
How to Efficiently Handle European Decimal Separators Using the Pandas Read_CSV Function
Exponentials in Python: X**Y VS Math.Pow(X, Y)
Remove or Replace Spaces in Column Names
Tkinter: Mouse Drag a Window Without Borders, Eg. Overridedirect(1)
Python: Fastest Way to Create a List of N Lists