Sorting strings is a necessary feature if you want to display text data, and sooner or later you will, as I did. Here is what I found about it.
ASCII? OK, simple enough. With more characters? You're using UTF-8, aren't you? Lucky you! UTF-8 can be lexically sorted byte by byte! (^_^)/
But what if you want to be as fast as possible, e.g. in a profiler? Sorting numbers is done fast with radix-sort, 8bit at a time, or maybe 11. Strings are array of bytes. Yes! We can apply a radix-sort with an arbitrary amount of 8bit blocks!
If you're sorting your data by multiple keys you need your sorting to be stable. Radix-sort can be made stable with the help of some memory.
In the details
Radix-sort is not so good per se with low cardinality arrays. But insertion-sort is! So, let's run some tests and discover the turn point and let's put together our radix-insertion-sort à la Frankenstein!
It seems to be around 150 elements.
And here we can admire the fabulous prodige!
Notes on the tests
- All the graphs have:
- on the Y axis time in ns unit;
- on the X axis the sorted strings count.
- I have generated the strings with a xorshift128plus, byte by byte, so they are not UTF-8 correct. Real cases would have UTF-8 correct strings, so a different distribution on the radix buckets.
- The strings have a random byte length between 5 and 1023 and are null terminated.
- I have not timed the copying of data into the arrays, nor the length computing, together with the sorting call.
- The graphed times are the minimum obtained through 16 tries, on the same data.
- The strings are generated once, at the beginning, so all the calls are on the same data.
Do what you want with the code: it's under the zlib license! LICENSE