Are there any Fuzzy Search or String Similarity Functions libraries written for C#?
Levenshtein distance implementation:
- Using LINQ (not really, see comments)
- Not using LINQ
I have a .NET 1.1 project in which I use the latter. It's simplistic, but works perfectly for what I need. From what I remember it needed a bit of tweaking, but nothing that wasn't obvious.
Fuzzy match in C#
Current versions don't have it built in.
I have seen and used Soundex (a method for fuzzy matching) operations for this in the past. Here's an article on how to implement Soundex in .Net.
http://www.codeproject.com/KB/aspnet/Soundex.aspx
Compare string similarity
static class LevenshteinDistance
{
public static int Compute(string s, string t)
{
if (string.IsNullOrEmpty(s))
{
if (string.IsNullOrEmpty(t))
return 0;
return t.Length;
}
if (string.IsNullOrEmpty(t))
{
return s.Length;
}
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// initialize the top and right of the table to 0, 1, 2, ...
for (int i = 0; i <= n; d[i, 0] = i++);
for (int j = 1; j <= m; d[0, j] = j++);
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
int min1 = d[i - 1, j] + 1;
int min2 = d[i, j - 1] + 1;
int min3 = d[i - 1, j - 1] + cost;
d[i, j] = Math.Min(Math.Min(min1, min2), min3);
}
}
return d[n, m];
}
}
Fuzzy Text Matching C#
Let me introduce you to the Levenshtein distance formula. It is awesome:
http://en.wikipedia.org/wiki/Levenshtein_distance
In information theory and computer science, the Levenshtein distance is a string metric for measuring the amount of difference between two sequences. The term edit distance is often used to refer specifically to Levenshtein distance.
Personally I used this in a healthcare setting, where Provider names were checked for duplicates. Using the Levenshtein process, we gave them a confidence rating and allowed them to determine if it was a true duplicate or something unique.
How can I check the input if it's nearly same or not?
Google shows me this
Approximate string matching
There are various string distance metrics you could use.
I would recommend Jaro-Winkler. Unlike edit-distance where the result of a comparison is in discrete units of edits, JW gives you a 0-1 score. It is especially suited for proper names. Also look at this nice tutorial and this SO question.
I haven't worked with C# but here are some implementations of JW I found online:
Impl 1 (They have a DOT NET version too if you look at the file list)
Impl 2
If you want to do a bit more sophisticated matching, you can try to do some custom normalization of word forms commonly occurring in company names such as ltd/limited, inc/incorporated, corp/corporation
to account for case insensitivity, abbreviations etc. This way if you compute
distance (normalize("foo corp."),
normalize("FOO CORPORATION") )
you should get the result to be 0 rather than 14 (which is what you would get if you computed levenshtein edit-distance).
Related Topics
How to Copy Data to Clipboard in C#
How to Check for a Network Connection
How to Get All Classes Within a Namespace
Stopwatch VS. Using System.Datetime.Now for Timing Events
Converting Bitmapimage to Bitmap and Vice Versa
How to Correctly Cast a Class to an Abstract Class When Using Type Generics
How to Install Msbuild on Os X and Linux
Cancellation Token in Task Constructor: Why
Why Does Boolean.Tostring Output "True" and Not "True"
No Overflow Exception for Int in C#
Better Way to Check If a Path Is a File or a Directory