1. Erik van Zijst
  2. dogslow
  3. Pull requests

Pull requests

#10 Declined

fixed string concatenation

  1. olevinsky

Get this error, when stack includes encoded utf-8 symbols:

Dogslow failed

Types: UnicodeDecodeError
Value: 'ascii' codec can't decode byte 0xd0 in position 28432: ordinal not in range(128)
Location:   dogslow/__init__.py in peek , line 134

Stacktrace (most recent call last):

  File "dogslow/__init__.py", line 134, in peek
    output += stack(frame, with_locals=True)

Fix issue: https://bitbucket.org/evzijst/dogslow/issue/3/unicode-safety

  • Dependencies Checking for dependencies...
  • Dependents Checking for dependents...
  • Learn about pull requests

Comments (3)

  1. Erik van Zijst repo owner

    I don't think we should just assume the stack information to be encoded in utf-8. The file names that appear in the backtrace for instance will be the raw strings as they are stored on the file system and most file systems do not bother with encoding.

    This means that a file name in a directory block could be written in utf-16 (meaning it will contain NULL bytes). When you then assume utf-8 on line 86, it would fail. Same for many machines in Asian countries.

    Maybe a better approach would be to decode the string based on the system's locale?

    1. Erik van Zijst repo owner

      I'm afraid it's actually more tricky than that.

      The stack function combines data from different sources:

      • the line of source code from the source file (line 51)
      • the module's filename (line 47)
      • plain ASCII (line 64)
      • whatever repr on the stack's variables returns (line 69)

      Since none of that uses unicode objects and the filename on the file system may be using a different encoding than the module's source file, we could be screwing ourselves over as we concatenate everything together as a single raw byte string and later trying to decode it to unicode might fail for any encoding we try.

      I don't know what the right way to handle this would be, but since each of the sources could be using a different character encoding, we might try to decode all the individual constituents independently, prior to concatenating them.

      Since we also don't know what encoding the individual parts might be in, we should maybe use something like chardet.

      This is all becoming fairly heavy-handed and it's still no guarantee for success. chardet will not always be able to unambiguously detect the right encoding for short strings and there may even be byte patterns that are technically valid in multiple encodings, but decode to different unicode code points. Likewise it's possible that a string cannot be decoded at all.

      A crude, quick workaround for now might be to force a certain encoding, replacing any byte that couldn't be decoded:

      s.decode('utf-8', 'replace')

      But of course that wouldn't make the output any more readable, or even guaranteed to be correct.