CLDR – a Python library for the Unicode Common Locale Data Repository
Exposing all of CLDR is important so that this implementation eliminates the need for alternative CLDR parser implementations. The cLDR database contains a lot of information, some only used in specialized application (e.g. «what is the minimum number of days to consider a week the first week of a year?»). Providing a complete high-level API is likely too much effort right now but at least a user should never have to bother implementing a CLDR XML parser.
In addition, the Python CLDR library will allow to update and change the repository manually. This will support the following case scenarios:
Rapid fixing of errors in the repository.
For example there is no difference in /locales/core/common/main/de.xml for the name and abbreviation of the era in the German localisation:
<eraNames> <era type="0">v. Chr.</era> <era type="1">n. Chr.</era> </eraNames> <eraAbbr> <era type="0">v. Chr.</era> <era type="1">n. Chr.</era> </eraAbbr>
Also there is another data bug in colCaseLevel – the types yes and no lead to the same string:
<types> … <type type="no" key="colCaseLevel" draft="contributed">Nach Groß-/Kleinschreibung sortieren</type> … <type type="yes" key="colCaseLevel" draft="contributed">Nach Groß-/Kleinschreibung sortieren</type> … </types>
These errors indicate there is the need to fix errors locally. Otherwise developers will build a custom layer to change the data later on or use CLDR only for sourcing e.g. PO files. Ideally users can configure the data location (so they don't have to change their system Python) and update the data without dependencies on a C compiler or some header files (ease deployment on restricted hosts e.g. shared webhosting).
Evaluation of alternatives
ICU is the C++ reference implementation which provides (high-level) access to CLDR data. CLDR data is stored inside a shared library.
- well known, established library
- reasonably RAM efficient because the OS will share libicudata.so between all processes
- efficient tools to strip down the data
- does not provide complete access to CLDR e.g. orientation (right-to-left) was not exposed until ICU 51 (uscript_isRightToLeft(script)).
- local data updates complicated, users need to rebuild ICU (and possibly Python)
- wrapping all classes takes a lot of effort (if you want to provide a PEP8-compatible API)
takes over the limitations of ICU.
In addition the PyICU implements only a subset of ICU and the method names are not PEP8 compatible.
Own ICU implementation
Some parts are still available in Python:
- the unicode database
- regular expressions for unicode strings
Initial performance benchmarks for different CLDR-based libraries.