Make sure the manifest only ever contains UTF-8

#28 Merged
  1. Stefan H. Holek

I would like to:

  • Back out the introduction of the 'surrogateescape' error handler in 0.6.29. This turned out to be ill-advised when dealing with the manifest. Surrogates also broke later parts of the tool-chain (zipfile).

  • Instead, skip files whose names cannot be encoded to UTF-8. This may seem drastic but is the only way to ensure that all metadata is in fact UTF-8 encoded.

  • Make an attempt to support UTF-8 filenames in the face of LANG=C. At least read_manifest produces a good filelist now. I do not intend to explore this further, but it was necessary to point out real problems further down the road.

I plan to merge this in the coming days. Comments welcome. Silence is consent. ;-)

Comments (4)

  1. Lennart Regebro

    "skip files whose names cannot be encoded to UTF-8" - Every valid filename should be possible to encode to UTF-8. If it isn't, it probably contains incorrect encodings in the first place. I'm completely open for explicitly failing if that happens.

  2. Stefan H. Holek author

    Yes, only surrogates make filenames unencodable. It is however easy to contract surrogates, e.g. by setting LANG=C. I was hoping the warning was "explicit" enough in that case. It is modeled after a similar warning in distutils.