Home

CLDR – a Python library for the Unicode Common Locale Data Repository

CLDR features

  • CLDR will provide a low level API supporting the entire Unicode Common Locale Data Repository and a high level API for the most common CLDR data.

    Exposing all of CLDR is important so that this implementation eliminates the need for alternative CLDR parser implementations. The cLDR database contains a lot of information, some only used in specialized application (e.g. «what is the minimum number of days to consider a week the first week of a year?»). Providing a complete high-level API is likely too much effort right now but at least a user should never have to bother implementing a CLDR XML parser.

  • In addition, the Python CLDR library will allow to update and change the repository manually. This will support the following case scenarios:

    • Rapid fixing of errors in the repository.

      For example there is no difference in /locales/core/common/main/de.xml for the name and abbreviation of the era in the German localisation:

      <eraNames>
          <era type="0">v. Chr.</era>
          <era type="1">n. Chr.</era>
      </eraNames>
      <eraAbbr>
          <era type="0">v. Chr.</era>
          <era type="1">n. Chr.</era>
      </eraAbbr>
      

      Also there is another data bug in colCaseLevel – the types yes and no lead to the same string:

      <types>
          …
          <type type="no" key="colCaseLevel" draft="contributed">Nach Groß-/Kleinschreibung sortieren</type>
          …
          <type type="yes" key="colCaseLevel" draft="contributed">Nach Groß-/Kleinschreibung sortieren</type>
          …
      </types>
      

      These errors indicate there is the need to fix errors locally. Otherwise developers will build a custom layer to change the data later on or use CLDR only for sourcing e.g. PO files. Ideally users can configure the data location (so they don't have to change their system Python) and update the data without dependencies on a C compiler or some header files (ease deployment on restricted hosts e.g. shared webhosting).

Evaluation of alternatives

ICU

ICU is the C++ reference implementation which provides (high-level) access to CLDR data. CLDR data is stored inside a shared library.

  • Pros:
    • well known, established library
    • reasonably RAM efficient because the OS will share libicudata.so between all processes
    • efficient tools to strip down the data
  • Cons:
    • does not provide complete access to CLDR e.g. orientation (right-to-left) was not exposed until ICU 51 (uscript_isRightToLeft(script)).
    • local data updates complicated, users need to rebuild ICU (and possibly Python)
    • wrapping all classes takes a lot of effort (if you want to provide a PEP8-compatible API)
PyICU

takes over the limitations of ICU.

In addition the PyICU implements only a subset of ICU and the method names are not PEP8 compatible.

Own ICU implementation

Some parts are still available in Python:

  • the unicode database
  • regular expressions for unicode strings

Performance

Initial performance benchmarks for different CLDR-based libraries.

Updated

Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.