Similarity String Comparison in Java
Yes, there are many well documented algorithms like:
- Cosine similarity
- Jaccard similarity
- Dice's coefficient
- Matching similarity
- Overlap similarity
- etc etc
A good summary ("Sam's String Metrics") can be found here (original link dead, so it links to Internet Archive)
Also check these projects:
- Simmetrics
- jtmt
Finding the most similar string among a set of millions of strings
The problem you're describing is a Nearest Neighbor Search (NNS). There are two main methods of solving NNS problems: exact and approximate.
If you need an exact solution, I would recommend a metric tree, such as the M-tree, the MVP-tree, and the BK-tree. These trees take advantage of the triangle inequality to speed up search.
If you're willing to accept an approximate solution, there are much faster algorithms. The current state of the art for approximate methods is Hierarchical Navigable Small World (hnsw). The Non-Metric Space Library (nmslib) provides an efficient implementation of hnsw as well as several other approximate NNS methods.
(You can compute the Levenshtein distance with Hirschberg's algorithm)
Find the similarity metric between two strings
There is a built in.
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
Using it:
>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0
What string similarity algorithms are there?
It seems you are needing some kind of fuzzy matching. Here is java implementation of some set of similarity metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Here is more detailed explanation of string metrics http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf it depends on how fuzzy and how fast your implementation must be.
How to compare almost similar Strings in Java? (String distance measure)
The Levensthein distance is a measure for how similar strings are. Or, more precisely, how many alterations have to be made that they are the same.
The algorithm is available in pseudo-code on Wikipedia. Converting that to Java shouldn't be much of a problem, but it's not built-in into the base class library.
Wikipedia has some more algorithms that measure similarity of strings.
Compare string similarity
static class LevenshteinDistance
{
public static int Compute(string s, string t)
{
if (string.IsNullOrEmpty(s))
{
if (string.IsNullOrEmpty(t))
return 0;
return t.Length;
}
if (string.IsNullOrEmpty(t))
{
return s.Length;
}
int n = s.Length;
int m = t.Length;
int[,] d = new int[n + 1, m + 1];
// initialize the top and right of the table to 0, 1, 2, ...
for (int i = 0; i <= n; d[i, 0] = i++);
for (int j = 1; j <= m; d[0, j] = j++);
for (int i = 1; i <= n; i++)
{
for (int j = 1; j <= m; j++)
{
int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;
int min1 = d[i - 1, j] + 1;
int min2 = d[i, j - 1] + 1;
int min3 = d[i - 1, j - 1] + cost;
d[i, j] = Math.Min(Math.Min(min1, min2), min3);
}
}
return d[n, m];
}
}
Related Topics
Should I Still Return Const Objects in C++11
Undefined Symbols for Architecture X86_64: Which Architecture Should I Use
Detect When Network Cable Unplugged
How to Get the Starting/Base Address of a Process in C++
How to Determine If Returned Pointer Is on the Stack or Heap
Intersection of Two 'Std::Map'S
What Does *& Mean in a Function Parameter
Store Results of Std::Stack .Pop() Method into a Variable
Openssl::Ssl_Library_Init() Memory Leak
Get Key Press in Windows Console
Fast Multiplication/Division by 2 for Floats and Doubles (C/C++)
Is Clrscr(); a Function in C++
Are Data Members Allocated in the Same Memory Space as Their Objects in C++
C++ Logon Task Schedule Error: No Mapping Between Account Names and Security Ids Was Done
Check If Class Is Derived from a Specific Class (Compile, Runtime Both Answers Available)