Levenshtein Distance Algorithm Better Than O(N*M)

Levenshtein Distance Algorithm better than O(n*m)?

Are you interested in reducing the time complexity or the space complexity ? The average time complexity can be reduced O(n + d^2), where n is the length of the longer string and d is the edit distance. If you are only interested in the edit distance and not interested in reconstructing the edit sequence, you only need to keep the last two rows of the matrix in memory, so that will be order(n).

If you can afford to approximate, there are poly-logarithmic approximations.

For the O(n +d^2) algorithm look for Ukkonen's optimization or its enhancement Enhanced Ukkonen. The best approximation that I know of is this one by Andoni, Krauthgamer, Onak

Does the Levenshtein (Edit Distance) algorithm perform faster than O(n*m) in a native graph database?

Since the implementations of apoc.text.levenshteinDistance and apoc.text.levenshteinSimilarity simply rely on org.apache.commons.text.similarity.LevenshteinDistance to do the calculation, the APOC library does not introduce any complexity improvements.

In any case, such a calculation should just compare 2 strings of text and should not in any way rely on the graphical nature of the DB.

And finally, it has been proven that the complexity cannot be improved (unless the Strong Exponential Time Hypothesis is wrong).

Most efficient way to calculate Levenshtein distance

The wikipedia entry on Levenshtein distance has useful suggestions for optimizing the computation -- the most applicable one in your case is that if you can put a bound k on the maximum distance of interest (anything beyond that might as well be infinity!) you can reduce the computation to O(n times k) instead of O(n squared) (basically by giving up as soon as the minimum possible distance becomes > k).

Since you're looking for the closest match, you can progressively decrease k to the distance of the best match found so far -- this won't affect the worst case behavior (as the matches might be in decreasing order of distance, meaning you'll never bail out any sooner) but average case should improve.

I believe that, if you need to get substantially better performance, you may have to accept some strong compromise that computes a more approximate distance (and so gets "a reasonably good match" rather than necessarily the optimal one).

Is this Levenshtein Distance algorithm correct?

Was a comment, but I feel it is probably suitable as an answer:

Short answer is "no", if you want the true shortest distance for any given inputs.

The reason your code appears more efficient (and the reason that other implementations create a matrix instead of doing what you're doing) is that your stepwise implementation ignores a lot of potential solutions.

Examples @BenVoigt gave illustrate this, another perhaps clearer illustration is ("aaaardvark", "aardvark") returns 8, should be 2: it's getting tripped up because it's matching the first a and thinking it can move on, when in fact a more optimal solution would be to consider the first two characters insertions.

Are there any string comparison alogrithms out there that are better than Levenshtein Distance?

I think it's meant for you to tokenize the word before employing Levenshtein. As an alternative there is Jaro-Winker distance too.

There's a .net library SimMetrics which seems to cover a few alternatives.

O(n) or faster algorithm for sorting a list by levenshtein distance?

The discussion below is my long-winded way of saying that your idea (as I understand it) cannot work in the general case. The reason? Because the Levenshtein distance between two strings of length N chould be N, but the strings have identical checksums. A reversed string, for example. Furthermore, the checksum difference between two strings with a Levenshtein distance of 1 can be 255 (or 65,536 in Unicode). With a range like that, "almost sorting," even if you could do it somehow (see below), isn't going to gain you much.

So you've noticed a correlation between a simple checksum and Levenshtein distance. It's an obvious relationship. If the Levenshtein distance between two strings is small, then those two strings contain mostly the same characters. So computation of the simple checksum will result in very similar values. Sometimes.

As somebody else pointed out, though, the reverse isn't true. The strings abcdef and fedcba have identical checksums, but their Levenshtein distance is fairly large for such a short string.

This isn't true only of reversals. Consider, for example, the string 00000000. The string 0000000~ will have a much larger checksum than 11111111, even though the Lev. distance is much smaller.

I think you'll find in the general case that the relationship between checksum and Lev. distance is ... sometimes coincidental. But let's ignore that particular problem and move on to your hypothesis about the sorting.

As I understand it (and, truthfully, your question isn't entirely clear on this point), you want to sort a list of strings based on their Levenshtein distance. You don't say distance from what, but I'll assume that somewhere you have a starting string, S, a bunch of other strings [S1, S2, S3, etc.], and you want to sort that list of other strings by Lev. distance from S.

Your hypothesis appears to be that computing a simple checksum for each string will allow you to do that sort more quickly.

The problem is that once you've computed the checksums, you have to sort them. And that's going to take O(n log n) time with a traditional comparison sort (and in any case, at least O(n) time if you have a special-purpose sort). And once you've got that supposedly-almost-ordered list, you have to compute the Lev. distances anyway, and then rearrange the list order to reflect the real distances. But what's the point?

You have to compute the Lev. distances anyway, and you will spend at least O(n) time sorting something. Why go to the extra trouble of computing and sorting checksums when you can more quickly just compute the Lev. distances and sort those?

Levenshtein Distance Algorithm Better Than O(N*M)