make-index does not work with files with "ñ" characters

Issue #563 resolved
Ricardo Cardona Ramirez created an issue

When i try make-index, i get encode error, that because the repository have files with non-ASCII characters like 'ñ'

This is the traceback:

{{{ #!python Traceback (most recent call last): File "/usr/bin/paster", line 8, in <module> load_entry_point('PasteScript==1.7.5', 'console_scripts', 'paster')() File "/usr/lib/python2.6/site-packages/PasteScript-1.7.5-py2.6.egg/paste/script/", line 104, in run invoke(command, command_name, options, args[1:]) File "/usr/lib/python2.6/site-packages/PasteScript-1.7.5-py2.6.egg/paste/script/", line 143, in invoke exit_code = File "/var/www/rhodecode/rhodecode/lib/", line 649, in run return super(BasePasterCommand, self).run(args[1:]) File "/usr/lib/python2.6/site-packages/PasteScript-1.7.5-py2.6.egg/paste/script/", line 238, in run result = self.command() File "/var/www/rhodecode/rhodecode/lib/indexers/", line 129, in command .run(full_index=self.options.full_index) File "/var/www/rhodecode/rhodecode/lib/indexers/", line 413, in run self.update_indexes() File "/var/www/rhodecode/rhodecode/lib/indexers/", line 405, in update_indexes self.update_file_index() File "/var/www/rhodecode/rhodecode/lib/indexers/", line 352, in update_file_index i, iwc = self.add_doc(writer, path, repo, repo_name) File "/var/www/rhodecode/rhodecode/lib/indexers/", line 141, in add_doc node = self.get_node(repo, path) File "/var/www/rhodecode/rhodecode/lib/indexers/", line 129, in get_node node = repo.get_changeset().get_node(n_path) File "/var/www/rhodecode/rhodecode/lib/vcs/backends/hg/", line 334, in get_node % (path, self.short_id)) rhodecode.lib.vcs.exceptions.NodeDoesNotExistError: There is no file nor directory at the given path: 'GUI/Controls/DatosDa\xc5\x84oApp.cs' at revision '9d9211f829be'


The real path is 'GUI/Controls/DatosDañoApp.cs', this looks like a encode problem.

Comments (10)

  1. Ricardo Cardona Ramirez reporter



    At function get_node (line 334):

    The variable self._file_paths have value 'GUI/Controls/DatosDa\xf1oApp.cs' is different than function parameter value "path" 'GUI/Controls/DatosDa\xc5\x84oApp.cs'

    Note that:

    1. \xf1 codified in cp1252
    2. \xc5\x84 codified in utf8

    Looks like, indexer is loading names paths in utf8 and changeset object with repository codification cp1252

  2. Marcin Kuzminski repo owner

    yes because of how mercurial stores the paths in bytestrings, i'll try to look into that issue

  3. Ricardo Cardona Ramirez reporter

    To avoid these problems, indexer and changeset should handle the same encoding.

  4. Marcin Kuzminski repo owner

    I think it's a mixed encoding problem, are you using utf8 for non-ascii characters ? if not try changed default encoding in the .ini file

  5. Keats .

    same here...

    rhodecode.lib.vcs.exceptions.NodeDoesNotExistError: There is no file nor directory at the given path: 'fla/docs/Requ\xc4\x99tes Jeu Toilokdo.xls' at revision '427c72eca3ce'

  6. Ricardo Cardona Ramirez reporter

    I do not understand; why indexer and changeset have different encodings?, since, both use the same methods to get the data from the repository

  7. Marcin Kuzminski repo owner

    whoosh requires everything unicode, somewhere there's a transformation and encoding mismatch, generally if you stick to utf8 it works, problems starts when you mix encodings, that's why i asked if this is not the case here ?

  8. Ricardo Cardona Ramirez reporter

    I changed ini file with "default_encoding = utf8" but the problem persists.

    Another tests was set enviroment variable with "HGENCODING=UTF8", with the same bad result.

  9. Ricardo Cardona Ramirez reporter

    For now my choice is to index everything it can, in this way, if an error occurs then it will be skipped and proceed to the next file, so at least the process will have most of the information repository

  10. Log in to comment