PyPy should set `errors="surrogatepass"` when decoding numpy arrays to unicode scalars

Create issue
Issue #3165 new
Eric Wieser created an issue

See https://github.com/numpy/numpy/issues/15363, where the bug was reported in numpy.

The fix in numpy was to replace calls to PyUnicode_DecodeUTF32 with calls to PyUnicode_FromKindAndData(PyUnicode_4BYTE_KIND, ...).

However, it seems that PyPI does not use our code there, and has their own implementation. Assuming PyUnicode_FromKindAndData is not available, the equivalent python code would be some_bytes.decode('utf-32-le', errors='surrogatepass').

Comments (9)

  1. mattip

    Not clear what you mean by “does not use our code there”. This is a call to a c-api function, which PyPy implements via an emulation layer. In particular, this is a call to PyUnicode_FromKindAndData(PyUnicode_4BYTE_KIND, ...), which goes to unicodehelper.str_decode_utf_32(s, errors, …) where errors by default is ‘strict’. The cpython code does not allow decoding errors at all (it assumes the UCS4 unicode string is valid), so I guess the PyPy implementation should as well.

  2. Eric Wieser reporter

    I must have been looking at the wrong branch, I wasn’t able to find that code anywhere. Your diagnosis looks correct to me, thanks.

  3. Eric Wieser reporter

    Looking at that code, the 1-byte kind looks ok, but 2-byte and 4-byte kinds are wrong for PEP383 surrogate escapes (and maybe different for illegal characters beyond maxunicode)

  4. mattip

    Truthfully, this is what scares me about using hypothesis for NumPy tests. This time it is PyPy, but we will run into cases where NumPy tests expose dark corner cases in CPython that, unless they cause segfaults, should just remain in the dark until actual users complain. If someone is creating illegal characters beyond maxunicode, they probably do not really mean to index them as scalars.

  5. Eric Wieser reporter

    Right, but there’s a real use-case here of the PEP383 characters that can enter an application from sys.argv. Agreed that hypothesis can explore the wrong dark corners though.

  6. mattip

    Should be fixed in c86c14b9e40b, which will be available as tomorrow’s nightly build, which is used by the NumPy CI test. Thanks for the report.

  7. Log in to comment