Optimized Unicode Representation

Issue #2309 new
Philip Jenvey
created an issue

PyPy needs PEP 393 like optimized unicode strings. This is for Python 3.3 compat and the potential space savings/perf, but it should work on PyPy2 as well

The plan is to use utf-8 strings as the as the underlying storage

wayedt made a prototype during GSoC in the utf8-unicode2 branch:

https://bitbucket.org/pypy/pypy/branches/compare/utf8-unicode2%0Ddefault

Comments (1)

  1. Marti R.

    Sorry to spam this, but +1 for using UTF-8 as the internal Unicode representation. Nearly all modern protocols and APIs transmit Unicode data in UTF-8 and I think CPython's approach results in too many unnecessary data conversions and copies, not to mention inefficient when the string contains higher codepoints.

    The advantage of CPython's approach, having O(1) access to individual characters by index, is rarely useful.

  2. Log in to comment