Issue #14 new

Handle file names that don't match the system encoding

Anonymous created an issue

On non-ascii filenames, an UnicodeEncodeError exception is thrown by os.walk in collect_files() from collectiongain.py

I dont know python at all, but running some tests, it looks like the culprit is: music_dir = un(music_dir, sys.getfilesystemencoding()) which is "UTF-8" in my case on linux. The filesystem is NFS.

Comments (8)

  1. Felix Krull repo owner

    Oh, encodings with Python 2 and Unix, such a constant source of joy. Could you please post the entire error output?

    Just from looking at it however, this somewhat reminds me of another report from a couple months ago, issue #12. The issue there was with file names that didn't match the system encoding (or rather, that weren't decodable in the system encoding). With that in mind, I have a question: I guess you ran collectiongain from some terminal window; did you type that path to your music entirely manually or did you use Tab completion? And whichever you did, can you try the other way and see if it behaves differently? (I'm asking because that's actually relevant to the encoding, see http://stackoverflow.com/questions/5408730/what-is-the-encoding-of-argv.)

    In any case, my suspicion is that the relevant paths on your NFS file system are not proper UTF-8 (i.e. not matching your system encoding), for whatever reason (as with #12).

  2. Bruno Jacquet

    Hi,

    OP here.

    Collecting files ...
    Traceback (most recent call last):
      File "/usr/bin/collectiongain", line 7, in <module>
        collectiongain()
      File "/usr/lib/python2.7/site-packages/rgain/script/collectiongain.py", line 342, in collectiongain
        opts.mp3_format, opts.ignore_cache, opts.jobs)
      File "/usr/lib/python2.7/site-packages/rgain/script/collectiongain.py", line 276, in do_collectiongain
        rgio.BaseFormatsMap(mp3_format).is_supported_format)
      File "/usr/lib/python2.7/site-packages/rgain/script/collectiongain.py", line 105, in collect_files
        for dirpath, dirnames, filenames in os.walk(music_dir):
      File "/usr/lib/python2.7/os.py", line 294, in walk
        for x in walk(new_path, topdown, onerror, followlinks):
      File "/usr/lib/python2.7/os.py", line 294, in walk
        for x in walk(new_path, topdown, onerror, followlinks):
      File "/usr/lib/python2.7/os.py", line 284, in walk
        if isdir(join(top, name)):
      File "/usr/lib/python2.7/posixpath.py", line 80, in join
        path += '/' + b
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xda in position 13: ordinal not in range(128)
    

    Using tab completion or typing in the whole path makes no difference. I indeed have some files with characters that are not going to correctly convert to ascii. But is converting to ascii really mandatory here? It also happens on local ext4 mountpoint.

    $ locale
    LANG=fr_FR.UTF-8
    LC_CTYPE="fr_FR.UTF-8"
    LC_NUMERIC="fr_FR.UTF-8"
    LC_TIME="fr_FR.UTF-8"
    LC_COLLATE="fr_FR.UTF-8"
    LC_MONETARY="fr_FR.UTF-8"
    LC_MESSAGES="fr_FR.UTF-8"
    LC_PAPER="fr_FR.UTF-8"
    LC_NAME="fr_FR.UTF-8"
    LC_ADDRESS="fr_FR.UTF-8"
    LC_TELEPHONE="fr_FR.UTF-8"
    LC_MEASUREMENT="fr_FR.UTF-8"
    LC_IDENTIFICATION="fr_FR.UTF-8"
    LC_ALL=
    
  3. Felix Krull repo owner

    This error message is just a symptom of an earlier problem: Your file names don't need to be ASCII-compatible, but they do need to be decodable as UTF-8, because that's your system encoding and collectiongain (and replaygain) require that your file names are in your system's character encoding.

    I did some checks and I think this error can happen if you have a file (or several) with a name that's not proper UTF-8. Guessing a bit, the usual encoding for Western European locales back in the olden times was latin1; maybe you still have files from before the UTF-8 transition with old latin1 file names? (For reference, the first UTF-8 Debian release was Etch in 2007.)

    Basically, you have two options:

    1) The right way: convert your file names to UTF-8. The convmv utility can do that (Manpage, should be easily available on most distributions); I've only quickly checked that, but something like this should convert your file names to UTF-8:

    $ convmv -f latin1 -t utf8 -r --notest <path to your music directory>
    

    (You should inspect the command's output to make sure everything looks ok.)

    2) As described in issue #12, you can temporarily switch your system character encoding (which still needs the latest version from the repository), but I don't recommend that.

    Ultimately, unless I mis-diagnosed your case, this isn't strictly a bug in collectiongain. Still, giving in to the Posix mis-feature seems more appealing by the minute.

  4. Duane Griffin

    I see the same error and your diagnosis seems correct, as does the proposed solution.

    Note that it is easy to get into this situation if, like me, you download files from somewhere like emusic in windows, then copy them onto your linux filesystem directly from a mounted NTFS drive.

    Given the terrible mess that the typical large-ish music collection is in, I wouldn't be surprised if this sort of thing is quite common. It might be worth catching the exception and giving a nice error message.

  5. Felix Krull repo owner

    I see the problem. I might add a somewhat nicer error message in the short term.

    I think I'll keep this bug open for now though (and change the title). IMO it's a horrible aspect of Unix that file names are simply arbitrary sequences of bytes, but I have to agree: it's also a thing people have in practice, accidentally or otherwise. I'll probably switch to treating all file names as plain byte sequences at some point; that should make all problems with mis-encoded file names go away (yes, that's the opposite of what I wrote in #12). I don't know when I'll have the time for that though.

  6. Bruno Jacquet

    Thanks for the heads-up, I'll give it a try.

    I agree with Duane, this may not be a bug in collectiongain, but when this happens we have no output about what file caused the error, we get an ugly stack trace and finding the problematic file is very difficult.

  7. Simon Chopin

    I'm not sure the diagnostic above is correct. Sadly, when decoding a string, Python 2 doesn't really care about your locale and will consider it as ASCII unless told otherwise. To get the proper local encoding name, one has to use locale.nl_langinfo(locale.CODESET) IIRC.

    In any case, a better error output would still be nice of course :-)

    Cheers, Simon

  8. Log in to comment