Good Python Modules for Fuzzy String Comparison

Good Python modules for fuzzy string comparison?

Levenshtein Python extension and C library.

https://github.com/ztane/python-Levenshtein/

The Levenshtein Python C extension module contains functions for fast
computation of
- Levenshtein (edit) distance, and edit operations
- string similarity
- approximate median strings, and generally string averaging
- string sequence and set similarity
It supports both normal and Unicode strings.

$ pip install python-levenshtein
...
$ python
>>> import Levenshtein
>>> help(Levenshtein.ratio)
ratio(...)
Compute similarity of two strings.

ratio(string1, string2)

The similarity is a number between 0 and 1, it's usually equal or
somewhat higher than difflib.SequenceMatcher.ratio(), becuase it's
based on real minimal edit distance.

Examples:
>>> ratio('Hello world!', 'Holly grail!')
0.58333333333333337
>>> ratio('Brian', 'Jesus')
0.0

>>> help(Levenshtein.distance)
distance(...)
Compute absolute Levenshtein distance of two strings.

distance(string1, string2)

Examples (it's hard to spell Levenshtein correctly):
>>> distance('Levenshtein', 'Lenvinsten')
4
>>> distance('Levenshtein', 'Levensthein')
2
>>> distance('Levenshtein', 'Levenshten')
1
>>> distance('Levenshtein', 'Levenshtein')
0

Is there a standard way in Python to fuzzy match a string with arbitrary list of acceptable values?

Use difflib.get_close_matches.

get_close_matches(word, possibilities[, n][, cutoff])

Return a list of the best “good enough” matches. word is a sequence for which close matches are desired (typically a string), and
possibilities is a list of sequences against which to match word
(typically a list of strings).

How to get all fuzzy matching substrings between two strings in python?

Here is a code to calculate the similarity by fuzzy ratio between the sub-string of string1 and full-string of string2. The code can also handle sub-string of string2 and full-string of string1 and also sub-string of string1 and sub-string of string2.

This one uses nltk to generate ngrams.

Typical algorithm:

  1. Generate ngrams from the given first string.

    Example:

    text2 = "The time of discomfort was 3 days ago."

    total_length = 8

In the code the param has values 5, 6, 7, 8.

param = 5

ngrams = ['The time of discomfort was', 'time of discomfort was 3',
'of discomfort was 3 days', 'discomfort was 3 days ago.']


  1. Compare it with second string.

    Example:

    text1 = Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

@param=5

  • compare The time of discomfort was vs text1 and get the fuzzy score
  • compare time of discomfort was 3 vs text1 and get the fuzzy score
  • and so on until all elements in ngrams_5 are finished

    Save sub-string if fuzzy score is greater than or equal to given threshold.

@param=6

  • compare The time of discomfort was 3 vs text1 and get the fuzzy score
  • and so on

until @param=8

You can revise the code changing n_start to 5 or so, so that the ngrams of string1 will be compared to the ngrams of string2, in this case this is a comparison of sub-string of string1 and sub-string of string2.

# Generate ngrams for string2
n_start = 5 # st2_length
for n in range(n_start, st2_length + 1):
...

For comparison I use:

fratio = fuzz.token_set_ratio(fs1, fs2)

Have a look at this also. You can try different ratios as well.

Your sample 'prescription of idx, 20mg to be given every four hours' has a fuzzy score of 52.

See sample console output.

7                    prescription of idx, 20mg to be given every four hours           52

Code

"""
fuzzy_match.py

https://stackoverflow.com/questions/72017146/how-to-get-all-fuzzy-matching-substrings-between-two-strings-in-python

Dependent modules:
pip install pandas
pip install nltk
pip install fuzzywuzzy
pip install python-Levenshtein

"""

from nltk.util import ngrams
import pandas as pd
from fuzzywuzzy import fuzz

# Sample strings.
text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text2 = "The time of discomfort was 3 days ago."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"

def myprocess(st1: str, st2: str, threshold):
"""
Generate sub-strings from st1 and compare with st2.
The sub-strings, full string and fuzzy ratio will be saved in csv file.
"""
data = []
st1_length = len(st1.split())
st2_length = len(st2.split())

# Generate ngrams for string1
m_start = 5
for m in range(m_start, st1_length + 1): # st1_length >= m_start

# If m=3, fs1 = 'Patient has checked', 'has checked in', 'checked in for' ...
# If m=5, fs1 = 'Patient has checked in for', 'has checked in for abdominal', ...
for s1 in ngrams(st1.split(), m):
fs1 = ' '.join(s1)

# Generate ngrams for string2
n_start = st2_length
for n in range(n_start, st2_length + 1):
for s2 in ngrams(st2.split(), n):
fs2 = ' '.join(s2)

fratio = fuzz.token_set_ratio(fs1, fs2) # there are other ratios

# Save sub string if ratio is within threshold.
if fratio >= threshold:
data.append([fs1, fs2, fratio])

return data

def get_match(sub, full, colname1, colname2, threshold=50):
"""
sub: is a string where we extract the sub-string.
full: is a string as the base/reference.
threshold: is the minimum fuzzy ratio where we will save the sub string. Max fuzz ratio is 100.
"""
save = myprocess(sub, full, threshold)

df = pd.DataFrame(save)
if len(df):
df.columns = [colname1, colname2, 'fuzzy_ratio']

is_sort_by_fuzzy_ratio_first = True

if is_sort_by_fuzzy_ratio_first:
df = df.sort_values(by=['fuzzy_ratio', colname1], ascending=[False, False])
else:
df = df.sort_values(by=[colname1, 'fuzzy_ratio'], ascending=[False, False])

df = df.reset_index(drop=True)

df.to_csv(f'{colname1}_{colname2}.csv', index=False)

# Print to console. Show only the sub-string and the fuzzy ratio. High ratio implies high similarity.
df1 = df[[colname1, 'fuzzy_ratio']]
print(df1.to_string())
print()

print(f'sub: {sub}')
print(f'base: {full}')
print()

def main():
get_match(text2, text1, 'string2', 'string1', threshold=50) # output string2_string1.csv
get_match(text3, text1, 'string3', 'string1', threshold=50)

get_match(text2, text3, 'string2', 'string3', threshold=10)

# Other param combo.

if __name__ == '__main__':
main()

Console Output

                                  string2  fuzzy_ratio
0 discomfort was 3 days ago. 72
1 of discomfort was 3 days ago. 67
2 time of discomfort was 3 days ago. 60
3 of discomfort was 3 days 59
4 The time of discomfort was 3 days ago. 55
5 time of discomfort was 3 days 51

sub: The time of discomfort was 3 days ago.
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

string3 fuzzy_ratio
0 be given every four hours 61
1 idx, 20mg to be given every four hours 58
2 was given a prescription of idx, 20mg to be given every four hours 56
3 to be given every four hours 56
4 John was given a prescription of idx, 20mg to be given every four hours 56
5 of idx, 20mg to be given every four hours 55
6 was given a prescription of idx, 20mg to be given every four 52
7 prescription of idx, 20mg to be given every four hours 52
8 given a prescription of idx, 20mg to be given every four hours 52
9 a prescription of idx, 20mg to be given every four hours 52
10 John was given a prescription of idx, 20mg to be given every four 52
11 idx, 20mg to be given every 51
12 20mg to be given every four hours 50

sub: John was given a prescription of idx, 20mg to be given every four hours
base: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

string2 fuzzy_ratio
0 time of discomfort was 3 days ago. 41
1 time of discomfort was 3 days 41
2 time of discomfort was 3 40
3 of discomfort was 3 days 40
4 The time of discomfort was 3 days ago. 40
5 of discomfort was 3 days ago. 39
6 The time of discomfort was 3 days 39
7 The time of discomfort was 38
8 The time of discomfort was 3 35
9 discomfort was 3 days ago. 34

sub: The time of discomfort was 3 days ago.
base: John was given a prescription of idx, 20mg to be given every four hours

Sample CSV output

string2_string1.csv

Sample Image

Using Spacy similarity

Here is the result of the comparison between sub-string of text3 and full text of text1 using spacy.

The result below is intended to be compared with the 2nd table above to see which method presents a better ranking of similarity.

I use the large model to get the result below.

Code

import spacy
import pandas as pd

nlp = spacy.load("en_core_web_lg")

text1 = "Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours."
text3 = "John was given a prescription of idx, 20mg to be given every four hours"

text3_sub = [
'be given every four hours', 'idx, 20mg to be given every four hours',
'was given a prescription of idx, 20mg to be given every four hours',
'to be given every four hours',
'John was given a prescription of idx, 20mg to be given every four hours',
'of idx, 20mg to be given every four hours',
'was given a prescription of idx, 20mg to be given every four',
'prescription of idx, 20mg to be given every four hours',
'given a prescription of idx, 20mg to be given every four hours',
'a prescription of idx, 20mg to be given every four hours',
'John was given a prescription of idx, 20mg to be given every four',
'idx, 20mg to be given every',
'20mg to be given every four hours'
]

data = []
for s in text3_sub:
doc1 = nlp(s)
doc2 = nlp(text1)
sim = round(doc1.similarity(doc2), 3)
data.append([s, text1, sim])

df = pd.DataFrame(data)
df.columns = ['from text3', 'text1', 'similarity']
df = df.sort_values(by=['similarity'], ascending=[False])
df = df.reset_index(drop=True)

df1 = df[['from text3', 'similarity']]
print(df1.to_string())

print()
print(f'text3: {text3}')
print(f'text1: {text1}')

Output

                                                                 from text3  similarity
0 was given a prescription of idx, 20mg to be given every four hours 0.904
1 John was given a prescription of idx, 20mg to be given every four hours 0.902
2 a prescription of idx, 20mg to be given every four hours 0.895
3 prescription of idx, 20mg to be given every four hours 0.893
4 given a prescription of idx, 20mg to be given every four hours 0.892
5 of idx, 20mg to be given every four hours 0.889
6 idx, 20mg to be given every four hours 0.883
7 was given a prescription of idx, 20mg to be given every four 0.879
8 John was given a prescription of idx, 20mg to be given every four 0.877
9 20mg to be given every four hours 0.877
10 idx, 20mg to be given every 0.835
11 to be given every four hours 0.834
12 be given every four hours 0.832

text3: John was given a prescription of idx, 20mg to be given every four hours
text1: Patient has checked in for abdominal pain which started 3 days ago. Patient was prescribed idx 20 mg every 4 hours.

It looks like the spacy method produces a nice ranking of similarity.

Python string comparison similarity

You could use an implementation of the Levenshtein Distance algorithm for non-precise string matching, for instance this one from Wikibooks.

Another option would be to, for instance, fold everything to lower case, remove spaces, etc. prior to raw comparison -- this of course depends on your use case:

import string, unicodedata
allowed = string.letters + string.digits
def fold(s):
s = unicodedata.normalize("NFKD", unicode(s).lower()).encode("ascii", "ignore")
s = "".join(c for c in s if c in allowed)
return s

for example in ['abc LLC','xyz, LLC', 'abc , LLC','xyz LLC']:
print "%r -> %r" % (example, fold(example))

would print

'abc LLC' -> 'abcllc'
'xyz, LLC' -> 'xyzllc'
'abc , LLC' -> 'abcllc'
'xyz LLC' -> 'xyzllc'


Related Topics



Leave a reply



Submit