Locale-aware sorting

Comments (2)

Andrey Golovizin

Hi, I've been thinking about adding support for the Unicode Collation Algorithm. It would put "Édouard" just before "Edward", I believe. Additionally, it is locale-independent so the output of Pybtex stays the same regardless of the locale settings. What do you think?

2018-05-22T13:55:28+00:00

Eric Marsden reporter

I'm not a specialist on these issues, but it seems that there are only limited differences between ISO 14651 (as implemented by Python's locale-aware sorting) and the Unicode Collation Algorithm:

http://unicode.org/faq/collation.html#13

Using the Unicode Collation Algorithm also probably implies a dependency on PyICU, a > 30MB dependency.

And it seems that the Unicode Collation Algorithm does have some locale-specific functionality, in the Common Locale Data Repository.

A final point: if users are expecting exact replication of historical bibtex output, this functionality needs to be conditional on a commandline option. My personal preference is for a tool that works in a more modern manner (accepting UTF-8 encoded inputs, respecting LC_* environment variables for collation for example); I don't see locale-independence as a feature (but I expect there are different opinions on this).

2018-05-22T15:20:44+00:00

Andrey Golovizin
Hi, I've been thinking about adding support for the Unicode Collation Algorithm. It would put "Édouard" just before "Edward", I believe. Additionally, it is locale-independent so the output of Pybtex stays the same regardless of the locale settings. What do you think?
- 2018-05-22T13:55:28+00:00
Eric Marsden reporter
I'm not a specialist on these issues, but it seems that there are only limited differences between ISO 14651 (as implemented by Python's locale-aware sorting) and the Unicode Collation Algorithm:

http://unicode.org/faq/collation.html#13

Using the Unicode Collation Algorithm also probably implies a dependency on PyICU, a > 30MB dependency.

And it seems that the Unicode Collation Algorithm does have some locale-specific functionality, in the Common Locale Data Repository.

A final point: if users are expecting exact replication of historical bibtex output, this functionality needs to be conditional on a commandline option. My personal preference is for a tool that works in a more modern manner (accepting UTF-8 encoded inputs, respecting LC_* environment variables for collation for example); I don't see locale-independence as a feature (but I expect there are different opinions on this).
- 2018-05-22T15:20:44+00:00
Log in to comment