Comparing Strings with Tolerance

Comparing strings with tolerance

You could use the Levenshtein Distance algorithm.

"The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character." - Wikipedia.com

This one is from dotnetperls.com:

using System;

/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
    /// <summary>
    /// Compute the distance between two strings.
    /// </summary>
    public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }
}

class Program
{
    static void Main()
    {
        Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
        Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
        Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
    }
}

You may in fact prefer to use the Damerau-Levenshtein distance algorithm, which also allows characters to be transposed, which is a common human error in data entry. You'll find a C# implementation of it here.

Smart string comparison

Levenshtein is not appropriate in this case. "Good Company Ltd" and "GoodCompany" if trimmed have a distance = 3 while "Good Company Ltd" and "Food Company Ltd" have a distance of 1, but totally a different meaning. I suggest Metaphone or Double Metaphone algorithm.

Using online metaphone comparer the results are:

Good Company Ltd = KTKMPNLTT
GoodCompany = KTKMPN
Food Company Ltd = FTKMPNLTT
GoodCompanyLLC = KTKMPNLK

In this way you know that GoodCompany, Good Company Ltd and GoodCompanyLLC are similar, while Food Company is misspelled or totally not related (KTKMPN is contained both in KTKMPNLTT and KTKMPNLK but not in FTKMPNLTT).

Look here for other algorithms comparisons.

Compare string similarity

static class LevenshteinDistance
{
    public static int Compute(string s, string t)
    {
        if (string.IsNullOrEmpty(s))
        {
            if (string.IsNullOrEmpty(t))
                return 0;
            return t.Length;
        }

        if (string.IsNullOrEmpty(t))
        {
            return s.Length;
        }

        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // initialize the top and right of the table to 0, 1, 2, ...
        for (int i = 0; i <= n; d[i, 0] = i++);
        for (int j = 1; j <= m; d[0, j] = j++);

        for (int i = 1; i <= n; i++)
        {
            for (int j = 1; j <= m; j++)
            {
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
                int min1 = d[i - 1, j] + 1;
                int min2 = d[i, j - 1] + 1;
                int min3 = d[i - 1, j - 1] + cost;
                d[i, j] = Math.Min(Math.Min(min1, min2), min3);
            }
        }
        return d[n, m];
    }
}

efficient string comparison in python including numerical evaluation

If you can store the elements in file1list and file2list as numpy arrays, your comparison becomes a good bit more simple (and probably faster too):

for entry1 in file1list:
    for entry2 in file2list:
        if(np.all(np.abs(entry1[1:7] - entry2[1:7]) > tol)):
            print entry1,entry2

Converting to numpy arrays is painless:

file1list = [ np.array(entry) for entry in file1list ]

This gets even a little bit better for a sorted file2list

file2list=sorted(file2list,key=operator.itemgetter(1,2,3,4,5,6))
for entry1 in file1list:
    for entry2 in file2list:
        d=np.abs(entry1[1:7]-entry2[1:7])
        if(np.all(d < tol)):
            print entry1,entry2
        elif(d[0] > tol):
            break

Similarity String Comparison in Java

Yes, there are many well documented algorithms like:

Cosine similarity
Jaccard similarity
Dice's coefficient
Matching similarity
Overlap similarity
etc etc

A good summary ("Sam's String Metrics") can be found here (original link dead, so it links to Internet Archive)

Also check these projects:

Simmetrics
jtmt

How much two strings are similar?(90%,100%,40%)

For 'short' string differences the algorithm you are searching for is called:

Levenshtein distance

http://en.wikipedia.org/wiki/Levenshtein_distance

For seeking differences in sentences you may wish to check for algorithms that solve the 'longest common sequence' problem.

One tool that does that is the (originally unix) 'diff'

Compare 3D Coordinates with Tolerance

First you will need a function to compare distances. I'm going to use Vector3 instead of a value tuple since it is easier to write:

public bool IsAlmostEqual(Vector3 a, Vector3 b, float epsilon){
    return DistanceSquared(a, b) < (epsilon * epsilon)
}
public double DistanceSquared(Vector3 a, Vector3 b){
    var c = a - b;
    return c.x * c.x + c.y * c.y + c.z * c.z ;
}

If you just want to check the corresponding index for each coordinate you would simply write a loop:

for(var i = 0; i < coords.Count; i++){
    if(IsAlmostEqual(coords[i], reference[i], 0.1){
        ...
    }
    else{
        ...
    }

If you want to check each combination of coords and reference position you simply add another loop. This will not scale well, but 1000 items would result in only about 500000 iterations of the inner loop, and I would expect that to take about a millisecond.

If you need to process larger sets of coordinates you should look into some kind of search structure, like a KD-tree.

Comparing Strings with Tolerance