Compare String Similarity

What are some algorithms for comparing how similar two strings are?

What you're looking for are called String Metric algorithms. There a significant number of them, many with similar characteristics. Among the more popular:

  • Levenshtein Distance : The minimum number of single-character edits required to change one word into the other. Strings do not have to be the same length
  • Hamming Distance : The number of characters that are different in two equal length strings.
  • Smith–Waterman : A family of algorithms for computing variable sub-sequence similarities.
  • Sørensen–Dice Coefficient : A similarity algorithm that computes difference coefficients of adjacent character pairs.

Have a look at these as well as others on the wiki page on the topic.

Find the similarity metric between two strings

There is a built in.

from difflib import SequenceMatcher

def similar(a, b):
return SequenceMatcher(None, a, b).ratio()

Using it:

>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0

Compare Similarity of two strings

You can try fuzzywuzzy with score , then you just need to set up score limit for cut

from fuzzywuzzy import fuzz
df['score'] = df[['Name Left','Name Right']].apply(lambda x : fuzz.partial_ratio(*x),axis=1)
df
Out[134]:
Match ID Name Left Name Right score
0 1 LemonFarms Lemon Farms Inc 90
1 2 Peachtree PeachTree Farms 89
2 3 Tomato Grove Orange Cheetah Farm 13

What string similarity algorithms are there?

It seems you are needing some kind of fuzzy matching. Here is java implementation of some set of similarity metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Here is more detailed explanation of string metrics http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf it depends on how fuzzy and how fast your implementation must be.

Compare string similarity

static class LevenshteinDistance
{
public static int Compute(string s, string t)
{
if (string.IsNullOrEmpty(s))
{
if (string.IsNullOrEmpty(t))
return 0;
return t.Length;
}

if (string.IsNullOrEmpty(t))
{
return s.Length;
}

int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];

// initialize the top and right of the table to 0, 1, 2, ...
for (int i = 0; i <= n; d[i, 0] = i++);
for (int j = 1; j <= m; d[0, j] = j++);

for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
int min1 = d[i - 1, j] + 1;
int min2 = d[i, j - 1] + 1;
int min3 = d[i - 1, j - 1] + cost;
d[i, j] = Math.Min(Math.Min(min1, min2), min3);
}
}
return d[n, m];
}
}

Similarity String Comparison in Java

Yes, there are many well documented algorithms like:

  • Cosine similarity
  • Jaccard similarity
  • Dice's coefficient
  • Matching similarity
  • Overlap similarity
  • etc etc

A good summary ("Sam's String Metrics") can be found here (original link dead, so it links to Internet Archive)

Also check these projects:

  • Simmetrics
  • jtmt

Is there any way to compare two string similarity quantitatively

Something that you want is similar to Levenshtein Distance. It gives you distance between two strings even if their lengths are not equal.

If two strings are exactly same then distance will be 0 and if they are similar then distance will be less.

Sample Code from Wikipedia:

// len_s and len_t are the number of characters in string s and t respectively
int LevenshteinDistance(string s, int len_s, string t, int len_t)
{ int cost;

/* base case: empty strings */
if (len_s == 0) return len_t;
if (len_t == 0) return len_s;

/* test if last characters of the strings match */
if (s[len_s-1] == t[len_t-1])
cost = 0;
else
cost = 1;

/* return minimum of delete char from s, delete char from t, and delete char from both */
return minimum(LevenshteinDistance(s, len_s - 1, t, len_t ) + 1,
LevenshteinDistance(s, len_s , t, len_t - 1) + 1,
LevenshteinDistance(s, len_s - 1, t, len_t - 1) + cost);
}

Similarity measure for Strings in Python

You could just use difflib. This function I got from an answer some time ago has served me well:

from difflib import SequenceMatcher

def similar(a, b):
return SequenceMatcher(None, a, b).ratio()

print (similar('tackoverflow','stackoverflow'))
print (similar('h0t','hot'))

0.96
0.666666666667

You could easily append the function or wrap it in another function to account for different degrees of similarities, like so, passing a third argument:

from difflib import SequenceMatcher

def similar(a, b, c):
sim = SequenceMatcher(None, a, b).ratio()
if sim > c:
return sim

print (similar('tackoverflow','stackoverflow', 0.9))
print (similar('h0t','hot', 0.9))

0.96
None


Related Topics



Leave a reply



Submit