HTTPS SSH

String Sorting

Sorting strings is a necessary feature if you want to display text data, and sooner or later you will, as I did. Here is what I found about it.

Which encoding?

ASCII? OK, simple enough. With more characters? You're using UTF-8, aren't you? Lucky you! UTF-8 can be lexically sorted byte by byte! (^_^)/

Performance

But what if you want to be as fast as possible, e.g. in a profiler? Sorting numbers is done fast with radix-sort, 8bit at a time, or maybe 11. Strings are array of bytes. Yes! We can apply a radix-sort with an arbitrary amount of 8bit blocks!

Stability

If you're sorting your data by multiple keys you need your sorting to be stable. Radix-sort can be made stable with the help of some memory.

In the details

Radix-sort is not so good per se with low cardinality arrays. But insertion-sort is! So, let's run some tests and discover the turn point and let's put together our radix-insertion-sort à la Frankenstein!

log scale graph

It seems to be around 150 elements. graph close up near the turn point

And here we can admire the fabulous prodige! log scale graph

Notes on the tests

  1. All the graphs have:
    • on the Y axis time in ns unit;
    • on the X axis the sorted strings count.
  2. I have generated the strings with a xorshift128plus, byte by byte, so they are not UTF-8 correct. Real cases would have UTF-8 correct strings, so a different distribution on the radix buckets.
  3. The strings have a random byte length between 5 and 1023 and are null terminated.
  4. I have not timed the copying of data into the arrays, nor the length computing, together with the sorting call.
  5. The graphed times are the minimum obtained through 16 tries, on the same data.
  6. The strings are generated once, at the beginning, so all the calls are on the same data.

License

Do what you want with the code: it's under the zlib license! LICENSE