Blazingly fast levenshtein distance function
This is the fastest implementation of the levenshtein distance function I have seen.
It gets it's speed from using very little memory, often keeping the buffer entirely in cache, and reducing the amount of work by skipping any prefix and postfix that won't add to the cost.
What's up with the funky name?
The levenshtein distance function was first described by Vladimir Iosifovich Levenshtein. Vladimir would be a pretty bland name to base a project on, but Iosifovich is very special and anyone who knows Vladimir Iosifovich Levenshtein, and his distance function, have a chance to regonize the purpose of the project just by looking at the name of the project.
To test, you need catch2 (I guess catch1 will work to, but I haven't tested it). Compiling the test program takes a fairly long time because of the size of catch2.
If you can think of a test case that is missing, let me know.
If you decide to use this function, let me know. I don't really have any need for it myself and I would love to hear about people who actually does use it.
I've implemented the classic recursive algorithm as well as the most common opitmization, based on dynamic programming. Neither of these implementations have had any optimization done to them.
I've benchmarked the three versions of the Levenshtein distance function against each other by running through /usr/shared/dict/linux.words on my computer, running Fedora 28 on an Intel Core i7-4790 of a Samsung 840 Evo SSD.
Each benchmark is done by reading the word list into a vector of strings and then comparing each string in the vector to every string in the same vector. To get more than just a single datapoint, I've made a test with a limit on the number of words. I've run a set of benchmarks for 10, 50, 100, 500, 1000, 5000, 10000.
Compiling the benchmarks programs, I used GCC 8.0.1 and compiled it with
The classic version was unable to handle more than 100 words before getting too slow have any practical value. On the other hand, on random input (basically), iosifovich outperformed the dynamic version by a factor of nearly 5. With many strings that share a common prefix or common suffix, this difference will only become greater.
How does it differ from other implementations ?
The classic function as described is an exponentialy growing function that takes forever to do anything.
Using dynamic programming this can easily be reduced to O(l_1 * l_2), where l1 and l2 are the lengths of the strings to compare, by creating an (l1 + 1) * (l2 + 1) matrix, filling it out and returning the lower right value of the matrix.
A few people figured out that you only need two lines of data; the one you're updating now and the previous one.
That's the best I have seen anyone do before I made this implementation.
The (l1 + 1) * (l2 + 1) matrix can be reduced to a single line -- rather than the two lines that others have used. This greatly improves performance -- on my machine, this doubled the throughput. The point is, you only really need to know the three values when you want to update a cell in the matrix and you can keep two of them in a buffer, while keeping the third value in a fixed location.
In reality, this is often far more than you actually need. By checking for common prefix and postfixes, you can reduce the size of the strings you need to check by a fair amount. If the strings are identical, or near identical, the work required tends towards O(n).
Observing that the cost is at least the distance between the lengths of the strings, the cost can be reduced even further.
The total cost can thus be transformed from an awful, expoential function, to O(l1 * l2) and further to O(n), where n is the length of the infix which doesn't include the longest common pre- or postfix.
There are three further optimizations for the case when we only care about strings within some edit distance k of a particular string. The first is obvious: If during the computation the intermediate value for the edit distance reaches k, we can return immediately without further computation.
The second is more subtle: If the edit distance is at most k, then any path in the dynamic programming table that defines a global alignment of distance at most k cannot contain any cell further away from (i, i) than k. That is, a misalignment of more than k requires that there be more than k insertions or deletions. Therefore, it suffices to compute the cells of the dynamic programming table only for cells between (i - k, i) and (i + k, i). As there are 2k+1 cells in this range, the complexity is reduced to O(n * (2k+1)).
Finally, when the edit distance is at most k, the number of insertions and deletions is no more than k. (In our coordinate system deletions correspond to shifting left and insertion to shifting right.) Also, left shifts "cancel out" right shifts, so there can be no more than k/2 left shifts if the alignment path ends on (n, n) (for n <=m), as there would then need to be k/2 right shifts to return to (n, n), giving more than k/2 + k/2 = k edits. It therefore suffices to compute the cells of the dynamic programming table for cells between (i - k/2, i) and (i + k/2, i). This reduces the complexity to O(n * (k+1)), where n is the length of the infix which doesn't include the longest common pre- or postfix and k is the largest edit distance we care about.
Some of these optimizations eleminate the ability to do backtracking, which means it's not supported.
Comparisons Between the Versions
iosifovitch Total number of words: 10000 user 0m15.907s iosifovitch2 Total number of words: 10000 user 0m41.614s iosifovitch3 Total number of words: 10000 user 0m16.134s iosifovitch_original Total number of words: 10000 user 0m28.565s dynamic Total number of words: 10000 user 2m25.505s