Issue #12 wontfix

UnicodeDecodeError: 'ascii' codec can't decode byte

Anonymous created an issue

replaygain fails on every file, that contains characters outside the ASCII range.

File used in this test case: magu?ro.mp3 (The ? represents 0x82)

$ unset LC_ALL $ replaygain magu�ro.mp3 Traceback (most recent call last): File "/usr/lib/python-exec/python2.7/replaygain", line 7, in <module> replaygain() File "/usr/lib/python2.7/site-packages/rgain/script/replaygain.py", line 213, in replaygain opts.mp3_format) File "/usr/lib/python2.7/site-packages/rgain/script/replaygain.py", line 67, in do_gain files = [un(filename, sys.getfilesystemencoding()) for filename in files] File "/usr/lib/python2.7/site-packages/rgain/script/init.py", line 37, in un return arg.decode(encoding) UnicodeDecodeError: 'ascii' codec can't decode byte 0x82 in position 4: ordinal not in range(128)

$ export LC_ALL=en_US.UTF-8 $ replaygain magu�ro.mp3 Traceback (most recent call last): File "/usr/lib/python-exec/python2.7/replaygain", line 7, in <module> replaygain() File "/usr/lib/python2.7/site-packages/rgain/script/replaygain.py", line 213, in replaygain opts.mp3_format) File "/usr/lib/python2.7/site-packages/rgain/script/replaygain.py", line 67, in do_gain files = [un(filename, sys.getfilesystemencoding()) for filename in files] File "/usr/lib/python2.7/site-packages/rgain/script/init.py", line 37, in un return arg.decode(encoding) File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True) UnicodeDecodeError: 'utf8' codec can't decode byte 0x82 in position 4: invalid start byte

And the last test is even stranger, because position 99 doesn't make any sense.

$ export LC_ALL=en_US $ replaygain magu�ro.mp3 Checking for Replay Gain information ... magu�ro.mp3: none Calculating Replay Gain information ... magu�ro.mp3: Traceback (most recent call last): File "/usr/lib/python-exec/python2.7/replaygain", line 7, in <module> replaygain() File "/usr/lib/python2.7/site-packages/rgain/script/replaygain.py", line 213, in replaygain opts.mp3_format) File "/usr/lib/python2.7/site-packages/rgain/script/replaygain.py", line 119, in do_gain raise Error(u"Error while calculating gain - %s" % exc) File "/usr/lib/python2.7/site-packages/rgain/init.py", line 51, in unicode return u"GST error: %s (%s)" % (self.message, self.debug) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 99: ordinal not in range(128)

BTW: I did install rgain today the very first time for evaluation. And the file above was the first file more or less randomly choosen by collectiongain. So, by coincidence, rgain failed on the first file I ever tried.

cu John

Comments (4)

  1. Felix Krull repo owner

    That's rather strange; non-ASCII file names work fine for me (on Ubuntu with UTF-8 everywhere). I can produce similar errors, but only by lying about what my system encoding is. What error message do you get if you don't change LC_ALL at all?

    To be honest, the only thing I can think of here is that you have inconsistent or plain wrong locale/encoding settings. Whatever your system encoding is, are you sure it matches the encoding of that file name?

    What operating system (incl. distribution and version) are you using; what is the output of locale and python -c 'import sys; print sys.getfilesystemencoding()' in a fresh shell?

    I've made some changes to Unicode handling in the code; it probably won't fix your problem, but please re-try the final example (LC_ALL=en_US) in particular since it seems a Unicode issue is masking another exception there.

  2. John Black

    Hi Felix

    it's a gentoo system. And to ensure, that it's not a problem with some old library I recompiled/reemerged everything related to pyhton. To be honest: I updated more or less every package... Therefore it took some time to answer again.

    And the result? Exactly the same as before. And the version at tip from 09.01. does the same, too, as you guessed.

    $ locale

    LANG=en_US.UTF-8

    LC_CTYPE="en_US.UTF-8"

    LC_NUMERIC="en_US.UTF-8"

    LC_TIME="en_US.UTF-8"

    LC_COLLATE="en_US.UTF-8"

    LC_MONETARY="en_US.UTF-8"

    LC_MESSAGES="en_US.UTF-8"

    LC_PAPER="en_US.UTF-8"

    LC_NAME="en_US.UTF-8"

    LC_ADDRESS="en_US.UTF-8"

    LC_TELEPHONE="en_US.UTF-8"

    LC_MEASUREMENT="en_US.UTF-8"

    LC_IDENTIFICATION="en_US.UTF-8"

    LC_ALL=en_US.UTF-8

    $ python -c 'import sys; print sys.getfilesystemencoding()'

    UTF-8

    Whatever your system encoding is, are you sure it matches the encoding of that file name?

    I'am absolutly sure, that it does NOT match.

    Replaygain chokes on every file, that is not encoded in UTF-8 or plain ASCII. So it fails on everything encoded in ISO 8859. And this first file I reported, isn't even ISO 8859. I have no idea about it's encoding, maybe some DOS/Windows encoding.

    Remember? magu?ro.mp3

    In Hex: 6d 61 67 75 82 72 6f 2e 6d 70 33

    0x82 isn't defined in ISO 8859.

    Let's rename the file to ISO 8859-1: 1ü.mp3

    In Hex: 31 fc 2e 6d 70 33

    $ LC_ALL=en_US.UTF-8

    $ replaygain 1ü.mp3

    Traceback (most recent call last):

    File "/usr/lib/python-exec/python2.7/replaygain", line 7, in <module> replaygain()

    File "/usr/lib/python2.7/site-packages/rgain/script/replaygain.py", line 213, in replaygain opts.mp3_format)

    File "/usr/lib/python2.7/site-packages/rgain/script/replaygain.py", line 67, in do_gain files = [un(filename, getfilesystemencoding()) for filename in files]

    File "/usr/lib/python2.7/site-packages/rgain/script/init.py", line 52, in un return arg.decode(encoding)

    File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode return codecs.utf_8_decode(input, errors, True)

    UnicodeDecodeError: 'utf8' codec can't decode byte 0xfc in position 1: invalid start byte

    $ LC_ALL=en_US (ISO 8859-1 or ISO Latin 1)

    $ replaygain 1ü.mp3

    Checking for Replay Gain information ...

    1�.mp3: none

    Calculating Replay Gain information ...

    1�.mp3:

    Error while calculating gain - GST error: Resource not found. (gstfilesrc.c(508): gst_file_src_start (): /GstPipeline:pipeline0/GstFileSrc:src:

    No such file "1�.mp3")

    So replaygain even chokes on files, when the encoding is okay.

    I wonder: As replaygain must not change the filename, why does it need to understand the encoding? Why not just interpret it as bytes and use them that way?

    BTW: UTF-8 encoded filenames work very well.

    John

  3. Felix Krull repo owner

    Oh, I see, your system setup is correct, you just have weird file names.

    I'll be putting my foot down here: rgain doesn't support file names that don't match the current system encoding (i.e. are not round-trip safe with it).

    I wonder: As replaygain must not change the filename, why does it need to understand the encoding? Why not just interpret it as bytes and use them that way?

    True, looking back at it, I could probably rewrite the file name handling to treat file names as byte sequences, up until display. Thing is, I don't think it's worth the effort. Conceptually, file names are text and IMO should be treated as such. I feel it's reasonable to not support file names that basically are opaque byte sequences instead of text.

    $ LC_ALL=en_US (ISO 8859-1 or ISO Latin 1)
    $ replaygain 1ü.mp3
    Checking for Replay Gain information ...
    1�.mp3: none
    Calculating Replay Gain information ...
    1�.mp3:
    Error while calculating gain - GST error: Resource not found. (gstfilesrc.c(508): gst_file_src_start (): /GstPipeline:pipeline0/GstFileSrc:src:
    No such file "1�.mp3")
    

    Alright, so I guess this is a supported use case then. There was actually a bug there; I was encoding all file names passed to GStreamer as UTF-8 since I assumed that's what it wanted, being GLib-based, but it seems it's happy with any byte sequence. I've fixed that in tip.

    (Bonus rambling: GLib assumes all file names are UTF-8; there's now a G_FILENAME_ENCODING environment variable to override this behaviour, but originally, one had to set G_BROKEN_FILENAMES. It's rather blunt, but in retrospect, I appreciate the sentiment. There's more detail in the GLib docs.)

    In conclusion: if you want to use non-UTF-8 file names, you'll need to use tip (68582f7, I'm not going to make a release just for that); if you want to use file names that don't match your system encoding, you'll need to use tip and set LC_ALL to something that matches the file name encoding. You can probably squeeze by using latin1 even for file names where you don't know the encoding since it should properly round-trip any byte sequence.

    I guess the short version is "won't fix"; if you insist on your file names, you'll have to resort to some locale and system encoding trickery.

  4. Log in to comment