Locale-aware sorting

Issue #115 new
Eric Marsden created an issue

In bst mode with a bibliography style that sorts references, the sorting of references uses Python's builtin sort operator without configuring it to be locale aware. As a consequence, accented surnames are placed at the end of the bibliography, rather than as bibliographic convention states (for instance with a UTF-8 locale, "Édouard" should come just before "Edward", but currently is rejected to the end of the bibliography list).

The attached patch fixes this. It uses Python3 functionality which most likely needs to be modified if your intent is to retain Python 2 compatibility.

Comments (2)

  1. Andrey Golovizin

    Hi, I've been thinking about adding support for the Unicode Collation Algorithm. It would put "Édouard" just before "Edward", I believe. Additionally, it is locale-independent so the output of Pybtex stays the same regardless of the locale settings. What do you think?

  2. Eric Marsden reporter

    I'm not a specialist on these issues, but it seems that there are only limited differences between ISO 14651 (as implemented by Python's locale-aware sorting) and the Unicode Collation Algorithm:

    http://unicode.org/faq/collation.html#13

    Using the Unicode Collation Algorithm also probably implies a dependency on PyICU, a > 30MB dependency.

    And it seems that the Unicode Collation Algorithm does have some locale-specific functionality, in the Common Locale Data Repository.

    A final point: if users are expecting exact replication of historical bibtex output, this functionality needs to be conditional on a commandline option. My personal preference is for a tool that works in a more modern manner (accepting UTF-8 encoded inputs, respecting LC_* environment variables for collation for example); I don't see locale-independence as a feature (but I expect there are different opinions on this).

  3. Log in to comment