fuse-zip displays file names with invalid UTF-8 sequences

Issue #68 open
François Degros created an issue

When mounting some ZIP archives containing files whose names have non-ASCII characters, fuse-zip generates invalid UTF-8 sequences.

Example with https://www.hueber.de/shared/audio/schritte-neu/011081_Schritte_Neu_Uebungsgrammatik_V2_Audiodateien.zip

$ fuse-zip -V
fuse-zip version: 0.7.1
libzip version: 1.5.2
FUSE library version: 2.9.9
fusermount version: 2.9.9
using FUSE kernel interface version 7.19

$ echo $LANG
en_US.utf8

$ fuse-zip -r '011081_Schritte_Neu_Uebungsgrammatik_V2_Audiodateien.zip' mnt

$ ls -l mnt
total 0
drwxrwxr-x 60 root root 0 Feb  2  2017 '011081 Schritte Neu '$'\232''bungsgrammatik V2'

$ ls mnt | od -c -tx1
0000000   0   1   1   0   8   1       S   c   h   r   i   t   t   e    
         30  31  31  30  38  31  20  53  63  68  72  69  74  74  65  20
0000020   N   e   u     232   b   u   n   g   s   g   r   a   m   m   a
         4e  65  75  20  9a  62  75  6e  67  73  67  72  61  6d  6d  61
0000040   t   i   k       V   2  \n
         74  69  6b  20  56  32  0a
0000047

Note that the U with Umlaut (Ü) is replaced by the byte <0x82>, which is an invalid UTF-8 sequence. The correct UTF-8 sequence would be <0xC3 0x9C>.

In fuse-zip’s code, calls to zip_get_name use the ZIP_FL_ENC_RAW flag. Changing this flag to ZIP_FL_ENC_GUESS fixes the issue. With the attached patch:

$ ls -l mnt
total 0
drwxrwxr-x 60 root root 0 Feb  2  2017 '011081 Schritte Neu Übungsgrammatik V2'

$ ls mnt | od -c -tx1
0000000   0   1   1   0   8   1       S   c   h   r   i   t   t   e    
         30  31  31  30  38  31  20  53  63  68  72  69  74  74  65  20
0000020   N   e   u     303 234   b   u   n   g   s   g   r   a   m   m
         4e  65  75  20  c3  9c  62  75  6e  67  73  67  72  61  6d  6d
0000040   a   t   i   k       V   2  \n
         61  74  69  6b  20  56  32  0a
0000050

Comments (5)

  1. Alexander Galanin repo owner

    Unfortunately ZIP file format has only two file name encodings: UTF-8 (bit 11 in “general purpose bit flag” is set) and unspecified encoding (bit 11 is not set). Many applications today and all the more applications in 90s-00s does not care about file name encodings and stores it as is using system default character set. For example, all DOS applications and Windows NT internal archiver uses charset CP866 for Cyrillic and does not save UTF-8 name. This results in a large set of legacy archives and archiver applications in ex-USSR that uses one-bit CP866 encoding to store Cyrillic file names. So I want to support them.

    fuse-zip supports built-in FUSE module iconv to convert file names from and to non-UTF8 encodings. Try to mount archive using the following command:

    fuse-zip -r -omodules=iconv,from_code=cp437 011081_Schritte_Neu_Uebungsgrammatik_V2_Audiodateien.zip mnt
    

    The negative side effect of using iconv module is a double conversion of file names when ZIP_FL_ENC_GUESS and -omodules=iconv used together. So I can’t use ZIP_FL_ENC_GUESS by default. Perhaps we should add the flag, which will allow to enable encoding guessing.

  2. Log in to comment