- edited description
fuse-zip displays file names with invalid UTF-8 sequences
When mounting some ZIP archives containing files whose names have non-ASCII characters, fuse-zip generates invalid UTF-8 sequences.
Example with https://www.hueber.de/shared/audio/schritte-neu/011081_Schritte_Neu_Uebungsgrammatik_V2_Audiodateien.zip
$ fuse-zip -V
fuse-zip version: 0.7.1
libzip version: 1.5.2
FUSE library version: 2.9.9
fusermount version: 2.9.9
using FUSE kernel interface version 7.19
$ echo $LANG
en_US.utf8
$ fuse-zip -r '011081_Schritte_Neu_Uebungsgrammatik_V2_Audiodateien.zip' mnt
$ ls -l mnt
total 0
drwxrwxr-x 60 root root 0 Feb 2 2017 '011081 Schritte Neu '$'\232''bungsgrammatik V2'
$ ls mnt | od -c -tx1
0000000 0 1 1 0 8 1 S c h r i t t e
30 31 31 30 38 31 20 53 63 68 72 69 74 74 65 20
0000020 N e u 232 b u n g s g r a m m a
4e 65 75 20 9a 62 75 6e 67 73 67 72 61 6d 6d 61
0000040 t i k V 2 \n
74 69 6b 20 56 32 0a
0000047
Note that the U with Umlaut (Ü) is replaced by the byte <0x82>, which is an invalid UTF-8 sequence. The correct UTF-8 sequence would be <0xC3 0x9C>.
In fuse-zip’s code, calls to zip_get_name
use the ZIP_FL_ENC_RAW
flag. Changing this flag to ZIP_FL_ENC_GUESS
fixes the issue. With the attached patch:
$ ls -l mnt
total 0
drwxrwxr-x 60 root root 0 Feb 2 2017 '011081 Schritte Neu Übungsgrammatik V2'
$ ls mnt | od -c -tx1
0000000 0 1 1 0 8 1 S c h r i t t e
30 31 31 30 38 31 20 53 63 68 72 69 74 74 65 20
0000020 N e u 303 234 b u n g s g r a m m
4e 65 75 20 c3 9c 62 75 6e 67 73 67 72 61 6d 6d
0000040 a t i k V 2 \n
61 74 69 6b 20 56 32 0a
0000050
Comments (5)
-
reporter -
repo owner - changed status to open
-
repo owner Unfortunately ZIP file format has only two file name encodings: UTF-8 (bit 11 in “general purpose bit flag” is set) and unspecified encoding (bit 11 is not set). Many applications today and all the more applications in 90s-00s does not care about file name encodings and stores it as is using system default character set. For example, all DOS applications and Windows NT internal archiver uses charset CP866 for Cyrillic and does not save UTF-8 name. This results in a large set of legacy archives and archiver applications in ex-USSR that uses one-bit CP866 encoding to store Cyrillic file names. So I want to support them.
fuse-zip supports built-in FUSE module
iconv
to convert file names from and to non-UTF8 encodings. Try to mount archive using the following command:fuse-zip -r -omodules=iconv,from_code=cp437 011081_Schritte_Neu_Uebungsgrammatik_V2_Audiodateien.zip mnt
The negative side effect of using
iconv
module is a double conversion of file names whenZIP_FL_ENC_GUESS
and-omodules=iconv
used together. So I can’t useZIP_FL_ENC_GUESS
by default. Perhaps we should add the flag, which will allow to enable encoding guessing. -
repo owner - changed title to fuse-zip displays file names with invalid UTF-8 sequences
-
repo owner - changed milestone to 0.8
- Log in to comment