Fix for SL-13073, UnicodeDecodeError when formatting llsd strings containing non-ascii characters to xml in python 2.

Merged
#3 · Created  · Last updated

Merged pull request

Fix for SL-13073, UnicodeDecodeError when formatting llsd strings containing non-ascii characters to xml in python 2.

36cffef·Author: ·Closed by: ·2020-04-29

Description

I believe that this regression was caused by a change in the llsd XML formatter that caused both str and unicode types be encoded to utf-8 in python 2.
Compare the xml_esc() method in current llsd.py to the old version of llsd.py

In older versions of llsd, only unicode types were encoded to utf8. encode() is only supposed to be used on unicode type, and decode() only on str type. This means that when we encode() a str in xml_esc(), it will first be converted to unicode through an implicit decode(), using the default encoding ("ascii") and then encoded back to a str using the requested encoder. Whenever we see a str type that can’t be decoded with the ascii decoder, that will cause the UnicodeDecodeError.

I've added enough tests to the Regression() grouping of in llsd_test.py to feel confident that this particular issue is fixed, but we could be open to other issues by not testing formatting to xml with a bigger variety of weird LLSD inputs. All tests currently pass on python 2.7 and 3.4.

In the regression test, I had to work around the fact that apparently in python 3, a bytes sequence b”string” will be formatted as a “binary” type by llsd. In contrast to python 2 where b”string” will be treated as a string by llsd. Hopefully this is intentional and I didn’t discover another bug?

I was also struggling to find a way to write the test that is considered valid syntax at the same time by python 2 and 3 that still tests all the cases, which is why you see me writing a string with each byte and then decode it unicode, for example.

0 attachments

0 comments

Loading commits...