What string similarity algorithms are there?
It seems you are needing some kind of fuzzy matching. Here is java implementation of some set of similarity metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Here is more detailed explanation of string metrics http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf it depends on how fuzzy and how fast your implementation must be.
What are some algorithms for comparing how similar two strings are?
What you're looking for are called String Metric algorithms. There a significant number of them, many with similar characteristics. Among the more popular:
- Levenshtein Distance : The minimum number of single-character edits required to change one word into the other. Strings do not have to be the same length
- Hamming Distance : The number of characters that are different in two equal length strings.
- Smith–Waterman : A family of algorithms for computing variable sub-sequence similarities.
- Sørensen–Dice Coefficient : A similarity algorithm that computes difference coefficients of adjacent character pairs.
Have a look at these as well as others on the wiki page on the topic.
Finding the most similar string among a set of millions of strings
The problem you're describing is a Nearest Neighbor Search (NNS). There are two main methods of solving NNS problems: exact and approximate.
If you need an exact solution, I would recommend a metric tree, such as the M-tree, the MVP-tree, and the BK-tree. These trees take advantage of the triangle inequality to speed up search.
If you're willing to accept an approximate solution, there are much faster algorithms. The current state of the art for approximate methods is Hierarchical Navigable Small World (hnsw). The Non-Metric Space Library (nmslib) provides an efficient implementation of hnsw as well as several other approximate NNS methods.
(You can compute the Levenshtein distance with Hirschberg's algorithm)
Similar String algorithm
I can't mark two answers here, so I'm going to answer and mark my own. The Levenshtein distance appears to be the correct method in most cases for this. But, it is worth mentioning j_random_hackers
answer as well. I have used an implementation of LZMA to test his theory, and it proves to be a sound solution. In my original question I was looking for a method for short strings (2 to 200 chars), where the Levenshtein Distance algorithm will work. But, not mentioned in the question was the need to compare two (larger) strings (in this case, text files of moderate size) and to perform a quick check to see how similar the two are. I believe that this compression technique will work well but I have yet to study it to find at which point one becomes better than the other, in terms of the size of the sample data and the speed/cost of the operation in question. I think a lot of the answers given to this question are valuable, and worth mentioning, for anyone looking to solve a similar string ordeal like I'm doing here. Thank you all for your great answers, and I hope they can be used to serve others well too.
Find the similarity metric between two strings
There is a built in.
from difflib import SequenceMatcher
def similar(a, b):
return SequenceMatcher(None, a, b).ratio()
Using it:
>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0
Finding groups of similar strings in a large set of strings
Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.
Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:
http://matpalm.com/resemblance/
Algorithm to find similar strings in a list of many strings
1st step, you must index your list with any fuzzy search indexing.
2nd, you needed iterate your list and search for neighbors by quick lookup in the pre-indexed list.
About fuzzy indexing:
Approx 15 years ago I wrote fuzzy search, which can found N closes neighbors. This is my modification of Wilbur's trigram algorithm, and this modification is named "Wilbur-Khovayko algorithm".
Basic idea: To split strings by trigrams, and search maximal intersection scores.
For example, lets we have string "hello world". This string is generates trigrams: hel ell llo "lo ", "o_w", and so on; Also, produces special prefix/suffix trigrams for each word, like $he $wo lo$ ld$.
Thereafter, for each trigram built index, in which term it is present.
So, this is list of term_ID for each trigram.
When user invoke some string - it also splits to trigrams, and program search maximal intersection score, and generates N-size list.
It works quick: I remember, on old Sun/Solaris, 256MB ram, 200MHZ CPU, it search 100 closest term in dictionary 5,000,000 terms, in 0.25s
You can get my old source from: http://olegh.ftp.sh/wilbur-khovayko.tgz
Similar String Comparison Algorithm
Sorting both, and comparing the results for equality, is not a bad approach for strings of reasonable lengths.
Another approach is to use a map/dictionary/object (depending on language) from character to number-of-occurrences. You then iterate over the first string, incrementing the counts, and iterate over the second string, decrementing them. You can return false
as soon as you get a negative number.
And if your set of possible characters is small enough to be considered constant, you can use an array as the "map", resulting in O(n) worst-case complexity.
Is this already a string similarity algorithm?
What you describe somewhat resembles what the following paper calls the Longest Common Substring (LCS). For a brief description and comparison to other algorithms:
A Comparison of Personal Name Matching
This algorithm [11] repeatedly finds and removes the longest common
sub-string in the two strings compared, up to a minimum lengths
(normally set to 2 or 3)....
A similarity measure can be calculated by
dividing the total length of the common sub-strings by the minimum,
maximum or average lengths of the two original strings (similar to
Smith-Waterman)....
this algorithm is
suitable for compound names that have words (like given- and surname)
swapped.
Related Topics
Link Errors Using <Filesystem> Members in C++17
How to Convert a Char Array to a String
Return Value of Operator Overloading in C++
Skipping Incompatible Libraries at Compile
C++: When (And How) Are C++ Global Static Constructors Called
Understanding Double Dispatch C++
Why Do Un-Named C++ Objects Destruct Before the Scope Block Ends
How to Convert a Struct Tm (Expressed in Utc) to Time_T Type
Why Do Lambda Functions in C++11 Not Have Function<> Types
X86 Mul Instruction from VS 2008/2010
C++ Inherit from Multiple Base Classes with the Same Virtual Function Name
Benefits of Header-Only Libraries
Embed Resources (Eg, Shader Code; Images) into Executable/Library with Cmake
Why Does C++ Code Missing a Formal Argument Name in a Function Definition Compile Without Warnings
How to Update a Printed Message in Terminal Without Reprinting
Reading Line from Text File and Putting the Strings into a Vector