[unicode] encoding error with hg repo and umlaut

Issue #141 new
Adi Kriegisch created an issue

The error is either triggerable by running 'paster make-index production.ini' or by browsing the files in the repo:

Traceback (most recent call last):
  File "paster", line 9, in <module>
    load_entry_point('PasteScript==1.7.5', 'console_scripts', 'paster')()
  File "(...)/lib/python2.7/site-packages/paste/script/command.py", line 104, in run
    invoke(command, command_name, options, args[1:])
  File "(...)/lib/python2.7/site-packages/paste/script/command.py", line 143, in invoke
    exit_code = runner.run(args)
  File "(...)/lib/python2.7/site-packages/kallithea/lib/utils.py", line 753, in run
    return super(BasePasterCommand, self).run(args[1:])
  File "(...)/lib/python2.7/site-packages/paste/script/command.py", line 238, in run
    result = self.command()
  File "(...)/lib/python2.7/site-packages/kallithea/lib/paster_commands/make_index.py", line 84, in command
    .run(full_index=self.options.full_index)
  File "(...)/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 451, in run
    self.update_indexes()
  File "(...)/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 443, in update_indexes
    self.update_file_index()
  File "(...)/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 390, in update_file_index
    i, iwc = self.add_doc(writer, path, repo, repo_name)
  File "(...)/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 175, in add_doc
    node = self.get_node(repo, path, index_rev)
  File "(...)/lib/python2.7/site-packages/kallithea/lib/indexers/daemon.py", line 163, in get_node
    node = cs.get_node(node_path)
  File "(...)/lib/python2.7/site-packages/kallithea/lib/vcs/backends/hg/changeset.py", line 352, in get_node
    % (path, self.short_id))
kallithea.lib.vcs.exceptions.NodeDoesNotExistError: There is no file nor directory at the given path: '�berblick_Machbarkeitsstudie.doc' at revision XXX

The filename itself decodes fine with either latin-1 or latin-2:

>>> l=os.listdir(".")
>>> l
['.hg', '\xdcberblick_Machbarkeitsstudie.doc']
>>> print l[1]
berblick_Machbarkeitsstudie.doc
>>> chardet.detect(l[1])
{'confidence': 0.8991773543668901, 'encoding': 'ISO-8859-2'}
>>> print l[1].decode('ISO-8859-2')
Überblick_Machbarkeitsstudie.doc

anything else you need that might help at debugging?

Comments (8)

  1. Mads Kiilerich

    I guess the best way to make it work is to manually set the HGENCODING environment variable to the right locale before launching Kallithea

  2. Adi Kriegisch reporter

    I don't think so: the system is a Debian Wheezy and uses UTF-8. The repository itself has been created on some kind of Windows machine (with XP) or an older version of Mac OS X. The filename encoding is definitely "strange". ;-)

    My point is: whatever kallithea does, it should not crash. Creating a broken repo and pushing invalid file names to kallithea is easy and can be abused to "DoS" the file indexer (as in the above example).

  3. Mads Kiilerich

    Mercurial store filenames in whatever encoding the OS uses. On windows that means latin1 because it uses the 'A' system calls. Someone has to tell Mercurial it has to use latin1 when decoding it to unicode for internal web-ready use.

    The actual encoding on linux systems is pretty much irrelevant to Mercurial and ignored. It doesn't do much text processing and everything is byte streams.

    Sure, Kallithea shouldn't crash. But there is also no way it can work correctly unless you tell it what encoding to use. (Some Mercurial developers have talked about implementing some 'guessing' of encoding. I'm not sure how that will work for web roundtrips.)

  4. Adi Kriegisch reporter

    ok... the behaviour improved (kind of):

    kallithea.lib.vcs.exceptions.NodeDoesNotExistError: There is no file nor directory at the given path: 'Überblick_Machbarkeitsstudie.doc' at revision XXX
    

    after I installed chardet in the venv. This btw. might also have an effect on #9: the unknown character symbol vanished from the web view.

    edit: ah, and specifying HGENCODING when running paster does not have any effect at all (tried with utf-8, latin-1 and latin-2).

  5. Adi Kriegisch reporter

    minor update with a hack that works here (tm). The fix is in /kallithea/lib/vcs/backends/hg/changeset.py in function get_node:

    path = self._fix_path(path)
    # FIX for Überblick_Machbarkeitsstudie.doc:
    # in filesystem 'Ü' is \xdc (as byte string)
    # in variable path 'Ü' is \xc3\x9c (as byte string after conversion)
    path = path.decode('utf-8').encode('raw_unicode_escape')
    

    this only works when chardet is installed because then some other conversions take place before. I am pretty sure this is an ugly hack and should most probably either go into _fix_path or even safe_str (from utils). I have no idea how big the impact on other parts of the code would be then...

  6. Adi Kriegisch reporter

    to make it work with repos containing umlaut files from linux and windows, I modified the line above to be conditional:

    if path not in self._file_paths and path not in self._dir_paths:
        path = path.decode('utf-8').encode('raw_unicode_escape')
    
  7. Log in to comment