Comparing strings with tolerance
You could use the Levenshtein Distance algorithm.
"The Levenshtein distance between two strings is defined as the minimum number of edits needed to transform one string into the other, with the allowable edit operations being insertion, deletion, or substitution of a single character." - Wikipedia.com
This one is from dotnetperls.com:
using System;
/// <summary>
/// Contains approximate string matching
/// </summary>
static class LevenshteinDistance
{
/// <summary>
/// Compute the distance between two strings.
/// </summary>
public static int Compute(string s, string t)
{
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// Step 1
if (n == 0)
{
return m;
}
if (m == 0)
{
return n;
}
// Step 2
for (int i = 0; i <= n; d[i, 0] = i++)
{
}
for (int j = 0; j <= m; d[0, j] = j++)
{
}
// Step 3
for (int i = 1; i <= n; i++)
{
//Step 4
for (int j = 1; j <= m; j++)
{
// Step 5
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
// Step 6
d[i, j] = Math.Min(
Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
d[i - 1, j - 1] + cost);
}
}
// Step 7
return d[n, m];
}
}
class Program
{
static void Main()
{
Console.WriteLine(LevenshteinDistance.Compute("aunt", "ant"));
Console.WriteLine(LevenshteinDistance.Compute("Sam", "Samantha"));
Console.WriteLine(LevenshteinDistance.Compute("flomax", "volmax"));
}
}
You may in fact prefer to use the Damerau-Levenshtein distance algorithm, which also allows characters to be transposed, which is a common human error in data entry. You'll find a C# implementation of it here.
Smart string comparison
Levenshtein is not appropriate in this case. "Good Company Ltd" and "GoodCompany" if trimmed have a distance = 3 while "Good Company Ltd" and "Food Company Ltd" have a distance of 1, but totally a different meaning. I suggest Metaphone or Double Metaphone algorithm.
Using online metaphone comparer the results are:
Good Company Ltd = KTKMPNLTT
GoodCompany = KTKMPN
Food Company Ltd = FTKMPNLTT
GoodCompanyLLC = KTKMPNLK
In this way you know that GoodCompany
, Good Company Ltd
and GoodCompanyLLC
are similar, while Food Company
is misspelled or totally not related (KTKMPN
is contained both in KTKMPNLTT
and KTKMPNLK
but not in FTKMPNLTT
).
Look here for other algorithms comparisons.
Compare string similarity
static class LevenshteinDistance
{
public static int Compute(string s, string t)
{
if (string.IsNullOrEmpty(s))
{
if (string.IsNullOrEmpty(t))
return 0;
return t.Length;
}
if (string.IsNullOrEmpty(t))
{
return s.Length;
}
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// initialize the top and right of the table to 0, 1, 2, ...
for (int i = 0; i <= n; d[i, 0] = i++);
for (int j = 1; j <= m; d[0, j] = j++);
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
int min1 = d[i - 1, j] + 1;
int min2 = d[i, j - 1] + 1;
int min3 = d[i - 1, j - 1] + cost;
d[i, j] = Math.Min(Math.Min(min1, min2), min3);
}
}
return d[n, m];
}
}
efficient string comparison in python including numerical evaluation
If you can store the elements in file1list
and file2list
as numpy arrays, your comparison becomes a good bit more simple (and probably faster too):
for entry1 in file1list:
for entry2 in file2list:
if(np.all(np.abs(entry1[1:7] - entry2[1:7]) > tol)):
print entry1,entry2
Converting to numpy arrays is painless:
file1list = [ np.array(entry) for entry in file1list ]
This gets even a little bit better for a sorted file2list
file2list=sorted(file2list,key=operator.itemgetter(1,2,3,4,5,6))
for entry1 in file1list:
for entry2 in file2list:
d=np.abs(entry1[1:7]-entry2[1:7])
if(np.all(d < tol)):
print entry1,entry2
elif(d[0] > tol):
break
Similarity String Comparison in Java
Yes, there are many well documented algorithms like:
- Cosine similarity
- Jaccard similarity
- Dice's coefficient
- Matching similarity
- Overlap similarity
- etc etc
A good summary ("Sam's String Metrics") can be found here (original link dead, so it links to Internet Archive)
Also check these projects:
- Simmetrics
- jtmt
How much two strings are similar?(90%,100%,40%)
For 'short' string differences the algorithm you are searching for is called:
Levenshtein distance
http://en.wikipedia.org/wiki/Levenshtein_distance
For seeking differences in sentences you may wish to check for algorithms that solve the 'longest common sequence' problem.
One tool that does that is the (originally unix) 'diff'
Compare 3D Coordinates with Tolerance
First you will need a function to compare distances. I'm going to use Vector3
instead of a value tuple since it is easier to write:
public bool IsAlmostEqual(Vector3 a, Vector3 b, float epsilon){
return DistanceSquared(a, b) < (epsilon * epsilon)
}
public double DistanceSquared(Vector3 a, Vector3 b){
var c = a - b;
return c.x * c.x + c.y * c.y + c.z * c.z ;
}
If you just want to check the corresponding index for each coordinate you would simply write a loop:
for(var i = 0; i < coords.Count; i++){
if(IsAlmostEqual(coords[i], reference[i], 0.1){
...
}
else{
...
}
If you want to check each combination of coords and reference position you simply add another loop. This will not scale well, but 1000 items would result in only about 500000 iterations of the inner loop, and I would expect that to take about a millisecond.
If you need to process larger sets of coordinates you should look into some kind of search structure, like a KD-tree.
Related Topics
Using Itextsharp to Extract and Update Links in an Existing PDF
How to Set the Default Xml Namespace for an Xdocument
How to Set Timer Resolution from C# to 1 Ms
How to Use Reflection to Call Method by Name
How Can a Windows Service Start a Process When a Timer Event Is Raised
How to Protect Resources That May Be Used in a Multi-Threaded or Async Environment
"There Was an Error Running the Selected Code Generator" in VS 2013 Scaffolding
"Parameter Not Valid" Exception Loading System.Drawing.Image
Play and Wait for Animation/Animator to Finish Playing
Registering a Custom JSONconverter Globally in JSON.Net
Why Is Ushort + Ushort Equal to Int
Change Sender Address When Sending Mail Through Gmail in C#
Using Static Variables Instead of Application State in ASP.NET