HTTPS SSH

Blazingly fast levenshtein distance function

This is the fastest implementation of the levenshtein distance function I have seen.

It gets it's speed from using very little memory, often keeping the buffer entirely in cache, and reducing the amount of work by skipping any prefix and postfix that won't add to the cost.

What's up with the funky name?

The levenshtein distance function was first described by Vladimir Iosifovich Levenshtein. Vladimir would be a pretty bland name to base a project on, but Iosifovich is very special and anyone who knows Vladimir Iosifovich Levenshtein, and his distance function, have a chance to regonize the purpose of the project just by looking at the name of the project.

Testing

To test, you need catch2 (I guess catch1 will work to, but I haven't tested it). Compiling the test program takes a fairly long time because of the size of catch2.

If you can think of a test case that is missing, let me know.

Acknowledgement

If you decide to use this function, let me know. I don't really have any need for it myself and I would love to hear about people who actually does use it.

Benchmarking

I've implemented the classic recursive algorithm as well as the most common opitmization, based on dynamic programming. Neither of these implementations have had any optimization done to them.

I've benchmarked the three versions of the Levenshtein distance function against each other by running through /usr/shared/dict/linux.words on my computer, running Fedora 28 on an Intel Core i7-4790 of a Samsung 840 Evo SSD.

Each benchmark is done by reading the word list into a vector of strings and then comparing each string in the vector to every string in the same vector. To get more than just a single datapoint, I've made a test with a limit on the number of words. I've run a set of benchmarks for 10, 50, 100, 500, 1000, 5000, 10000.

Compiling the benchmarks programs, I used GCC 8.0.1 and compiled it with -O3 -std=c++17.

#words classic dynamic iosifovich
10 0.029s 0.002s 0.001s
50 0.101s 0.001s 0.001s
100 0.390s 0.003s 0.002s
500 DNF 0.098s 0.016s
1000 DNF 0.372s 0.077s
5000 DNF 11.353s 2.204s
10000 DNF 42.931s 8.755s
50000 DNF 1106.787s 236.294s

The classic version was unable to handle more than 100 words before getting too slow have any practical value. On the other hand, on random input (basically), iosifovich outperformed the dynamic version by a factor of nearly 5. With many strings that share a common prefix or common suffix, this difference will only become greater.

How does it differ from other implementations ?

The classic function as described is an exponentialy growing function that takes forever to do anything.

Using dynamic programming this can easily be reduced to O(l_1 * l_2), where l1 and l2 are the lengths of the strings to compare, by creating an (l1 + 1) * (l2 + 1) matrix, filling it out and returning the lower right value of the matrix.

A few people figured out that you only need two lines of data; the one you're updating now and the previous one.

That's the best I have seen anyone do before I made this implementation.

The (l1 + 1) * (l2 + 1) matrix can be reduced to a single line -- rather than the two lines that others have used. This greatly improves performance -- on my machine, this doubled the throughput. The point is, you only really need to know the three values when you want to update a cell in the matrix and you can keep two of them in a buffer, while keeping the third value in a fixed location.

In reality, this is often far more than you actually need. By checking for common prefix and postfixes, you can reduce the size of the strings you need to check by a fair amount. If the strings are identical, or near identical, the work required tends towards O(n).

Observing that the cost is at least the distance between the lengths of the strings, the cost can be reduced even further.

The total cost can thus be transformed from an awful, expoential function, to O(l1 * l2) and further to O(n), where n is the length of the infix which doesn't include the longest common pre- or postfix.

Some of these optimizations eleminates the ability to do backtracking, which means it's not supported.