unicodedata does not protect against larger-than-0x10ffff values

Issue #2481 new
Armin Rigo created an issue

You can get nonsense and potential crashes of unicodedata by doing this:

>>>> u = array.array('u', 'abcd').tounicode()
>>>> u
u'\U64636261'
>>>> u.isspace()

Comments (6)

  1. Carl Friedrich Bolz-Tereick

    The unicodedata module itself is doing this correctly, it just seems to be the unicode methods that cause problems.

  2. Armin Rigo reporter

    Unsure what you mean. For example, unicodedata/unicodedb_5_2_0.py will crash with an IndexError (a segfault after translation) if you call isspace(x) for a value of x not in the official unicode range. So I would say instead that unicodedata is not doing anything about this by itself.

  3. Carl Friedrich Bolz-Tereick

    I am saying that a lot of the functions in the applevel unicodedata module deal with too large characters in a non-crashing way:

    >>>> import unicodedata, array
    >>>> u = array.array('u', 'abcd').tounicode()
    >>>> u
    u'\U64636261'
    >>>> unicodedata.name(u)
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
    ValueError: no such name
    
  4. Log in to comment