How to Calculate Distance Similarity Measure of Given 2 Strings

How to calculate distance similarity measure of given 2 strings?

What you are looking for is called edit distance or Levenshtein distance. The wikipedia article explains how it is calculated, and has a nice piece of pseudocode at the bottom to help you code this algorithm in C# very easily.

Here's an implementation from the first site linked below:

private static int  CalcLevenshteinDistance(string a, string b)
{
if (String.IsNullOrEmpty(a) && String.IsNullOrEmpty(b)) {
return 0;
}
if (String.IsNullOrEmpty(a)) {
return b.Length;
}
if (String.IsNullOrEmpty(b)) {
return a.Length;
}
int lengthA = a.Length;
int lengthB = b.Length;
var distances = new int[lengthA + 1, lengthB + 1];
for (int i = 0; i <= lengthA; distances[i, 0] = i++);
for (int j = 0; j <= lengthB; distances[0, j] = j++);

for (int i = 1; i <= lengthA; i++)
for (int j = 1; j <= lengthB; j++)
{
int cost = b[j - 1] == a[i - 1] ? 0 : 1;
distances[i, j] = Math.Min
(
Math.Min(distances[i - 1, j] + 1, distances[i, j - 1] + 1),
distances[i - 1, j - 1] + cost
);
}
return distances[lengthA, lengthB];
}

Find the similarity metric between two strings

There is a built in.

from difflib import SequenceMatcher

def similar(a, b):
return SequenceMatcher(None, a, b).ratio()

Using it:

>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0

Compare Similarity of two strings

You can try fuzzywuzzy with score , then you just need to set up score limit for cut

from fuzzywuzzy import fuzz
df['score'] = df[['Name Left','Name Right']].apply(lambda x : fuzz.partial_ratio(*x),axis=1)
df
Out[134]:
Match ID Name Left Name Right score
0 1 LemonFarms Lemon Farms Inc 90
1 2 Peachtree PeachTree Farms 89
2 3 Tomato Grove Orange Cheetah Farm 13

How can I measure the similarity between 2 strings?

There are various different ways of doing this. Have a look at the Wikipedia "String similarity measures" page for links to other pages with algorithms.

I don't think any of those algorithms take sounds into consideration, however - so "staq overflow" would be as similar to "stack overflow" as "staw overflow" despite the first being more similar in terms of pronunciation.

I've just found another page which gives rather more options... in particular, the Soundex algorithm (Wikipedia) may be closer to what you're after.

How can I calculate similarity between two strings in C#?

Try something like this

double d = (LevenshteinDist(s, t) + LevenshteinDist(s.ToLower(), t.ToLower())) /
2.0 * Math.Max(s.Length, t.Length);

If you want to give less importance to case differences than letter differences, you can give different weights to the terms

double d = (0.15*LevenshteinDist(s, t) + 
0.35*LevenshteinDist(s.ToLower(), t.ToLower())) /
Math.Max(s.Length, t.Length);

Note that the weights sum up to 0.5, thus makting the division by 2.0 obsolete.

What is an efficient way to measure similarity between two strings? (Levenshtein Distance makes stack too deep)

Consider a non-recursive version to avoid the excessive call stack overhead. Seth Schroeder has an iterative implementation in Ruby which uses multi-dimensional arrays instead; it appears to be related to the dynamic programming approach for Levenshtein distance (as outlined in the pseudocode for the Wikipedia article). Seth's ruby code is reproduced below:

def levenshtein(s1, s2)
d = {}
(0..s1.size).each do |row|
d[[row, 0]] = row
end
(0..s2.size).each do |col|
d[[0, col]] = col
end
(1..s1.size).each do |i|
(1..s2.size).each do |j|
cost = 0
if (s1[i-1] != s2[j-1])
cost = 1
end
d[[i, j]] = [d[[i - 1, j]] + 1,
d[[i, j - 1]] + 1,
d[[i - 1, j - 1]] + cost
].min
next unless @@damerau
if (i > 1 and j > 1 and s1[i-1] == s2[j-2] and s1[i-2] == s2[j-1])
d[[i, j]] = [d[[i,j]],
d[[i-2, j-2]] + cost
].min
end
end
end
return d[[s1.size, s2.size]]
end


Related Topics



Leave a reply



Submit