Similar String Algorithm

What string similarity algorithms are there?

It seems you are needing some kind of fuzzy matching. Here is java implementation of some set of similarity metrics http://www.dcs.shef.ac.uk/~sam/stringmetrics.html. Here is more detailed explanation of string metrics http://www.cs.cmu.edu/~wcohen/postscript/ijcai-ws-2003.pdf it depends on how fuzzy and how fast your implementation must be.

What are some algorithms for comparing how similar two strings are?

What you're looking for are called String Metric algorithms. There a significant number of them, many with similar characteristics. Among the more popular:

  • Levenshtein Distance : The minimum number of single-character edits required to change one word into the other. Strings do not have to be the same length
  • Hamming Distance : The number of characters that are different in two equal length strings.
  • Smith–Waterman : A family of algorithms for computing variable sub-sequence similarities.
  • Sørensen–Dice Coefficient : A similarity algorithm that computes difference coefficients of adjacent character pairs.

Have a look at these as well as others on the wiki page on the topic.

Finding the most similar string among a set of millions of strings

The problem you're describing is a Nearest Neighbor Search (NNS). There are two main methods of solving NNS problems: exact and approximate.

If you need an exact solution, I would recommend a metric tree, such as the M-tree, the MVP-tree, and the BK-tree. These trees take advantage of the triangle inequality to speed up search.

If you're willing to accept an approximate solution, there are much faster algorithms. The current state of the art for approximate methods is Hierarchical Navigable Small World (hnsw). The Non-Metric Space Library (nmslib) provides an efficient implementation of hnsw as well as several other approximate NNS methods.

(You can compute the Levenshtein distance with Hirschberg's algorithm)

Similar String algorithm

I can't mark two answers here, so I'm going to answer and mark my own. The Levenshtein distance appears to be the correct method in most cases for this. But, it is worth mentioning j_random_hackers answer as well. I have used an implementation of LZMA to test his theory, and it proves to be a sound solution. In my original question I was looking for a method for short strings (2 to 200 chars), where the Levenshtein Distance algorithm will work. But, not mentioned in the question was the need to compare two (larger) strings (in this case, text files of moderate size) and to perform a quick check to see how similar the two are. I believe that this compression technique will work well but I have yet to study it to find at which point one becomes better than the other, in terms of the size of the sample data and the speed/cost of the operation in question. I think a lot of the answers given to this question are valuable, and worth mentioning, for anyone looking to solve a similar string ordeal like I'm doing here. Thank you all for your great answers, and I hope they can be used to serve others well too.

Find the similarity metric between two strings

There is a built in.

from difflib import SequenceMatcher

def similar(a, b):
return SequenceMatcher(None, a, b).ratio()

Using it:

>>> similar("Apple","Appel")
0.8
>>> similar("Apple","Mango")
0.0

Finding groups of similar strings in a large set of strings

Another popular method is to associate the strings by their Jaccard index. Start with http://en.wikipedia.org/wiki/Jaccard_index.

Here's a article about using the Jaccard-index (and a couple of other methods) to solve a problem like yours:

http://matpalm.com/resemblance/

Algorithm to find similar strings in a list of many strings

1st step, you must index your list with any fuzzy search indexing.

2nd, you needed iterate your list and search for neighbors by quick lookup in the pre-indexed list.

About fuzzy indexing:

Approx 15 years ago I wrote fuzzy search, which can found N closes neighbors. This is my modification of Wilbur's trigram algorithm, and this modification is named "Wilbur-Khovayko algorithm".

Basic idea: To split strings by trigrams, and search maximal intersection scores.

For example, lets we have string "hello world". This string is generates trigrams: hel ell llo "lo ", "o_w", and so on; Also, produces special prefix/suffix trigrams for each word, like $he $wo lo$ ld$.

Thereafter, for each trigram built index, in which term it is present.

So, this is list of term_ID for each trigram.

When user invoke some string - it also splits to trigrams, and program search maximal intersection score, and generates N-size list.

It works quick: I remember, on old Sun/Solaris, 256MB ram, 200MHZ CPU, it search 100 closest term in dictionary 5,000,000 terms, in 0.25s

You can get my old source from: http://olegh.ftp.sh/wilbur-khovayko.tgz

Similar String Comparison Algorithm

Sorting both, and comparing the results for equality, is not a bad approach for strings of reasonable lengths.

Another approach is to use a map/dictionary/object (depending on language) from character to number-of-occurrences. You then iterate over the first string, incrementing the counts, and iterate over the second string, decrementing them. You can return false as soon as you get a negative number.

And if your set of possible characters is small enough to be considered constant, you can use an array as the "map", resulting in O(n) worst-case complexity.

Is this already a string similarity algorithm?

What you describe somewhat resembles what the following paper calls the Longest Common Substring (LCS). For a brief description and comparison to other algorithms:
A Comparison of Personal Name Matching

This algorithm [11] repeatedly finds and removes the longest common
sub-string in the two strings compared, up to a minimum lengths
(normally set to 2 or 3).

...

A similarity measure can be calculated by
dividing the total length of the common sub-strings by the minimum,
maximum or average lengths of the two original strings (similar to
Smith-Waterman).

...

this algorithm is
suitable for compound names that have words (like given- and surname)
swapped.



Related Topics



Leave a reply



Submit