make-index does not work with files with "ñ" characters

Issue #563 resolved
Ricardo Cardona Ramirez
created an issue

When i try make-index, i get encode error, that because the repository have files with non-ASCII characters like 'ñ'

This is the traceback:



Traceback (most recent call last): File "/usr/bin/paster", line 8, in <module> load_entry_point('PasteScript==1.7.5', 'console_scripts', 'paster')() File "/usr/lib/python2.6/site-packages/PasteScript-1.7.5-py2.6.egg/paste/script/", line 104, in run invoke(command, command_name, options, args[1:]) File "/usr/lib/python2.6/site-packages/PasteScript-1.7.5-py2.6.egg/paste/script/", line 143, in invoke exit_code = File "/var/www/rhodecode/rhodecode/lib/", line 649, in run return super(BasePasterCommand, self).run(args[1:]) File "/usr/lib/python2.6/site-packages/PasteScript-1.7.5-py2.6.egg/paste/script/", line 238, in run result = self.command() File "/var/www/rhodecode/rhodecode/lib/indexers/", line 129, in command .run(full_index=self.options.full_index) File "/var/www/rhodecode/rhodecode/lib/indexers/", line 413, in run self.update_indexes() File "/var/www/rhodecode/rhodecode/lib/indexers/", line 405, in update_indexes self.update_file_index() File "/var/www/rhodecode/rhodecode/lib/indexers/", line 352, in update_file_index i, iwc = self.add_doc(writer, path, repo, repo_name) File "/var/www/rhodecode/rhodecode/lib/indexers/", line 141, in add_doc node = self.get_node(repo, path) File "/var/www/rhodecode/rhodecode/lib/indexers/", line 129, in get_node node = repo.get_changeset().get_node(n_path) File "/var/www/rhodecode/rhodecode/lib/vcs/backends/hg/", line 334, in get_node % (path, self.short_id)) rhodecode.lib.vcs.exceptions.NodeDoesNotExistError: There is no file nor directory at the given path: 'GUI/Controls/DatosDa\xc5\x84oApp.cs' at revision '9d9211f829be'


The real path is 'GUI/Controls/DatosDañoApp.cs', this looks like a encode problem.

Comments (10)

  1. Ricardo Cardona Ramirez reporter



    At function get_node (line 334):

    The variable self._file_paths have value 'GUI/Controls/DatosDa\xf1oApp.cs' is different than function parameter value "path" 'GUI/Controls/DatosDa\xc5\x84oApp.cs'

    Note that:

    1. \xf1 codified in cp1252
    2. \xc5\x84 codified in utf8

    Looks like, indexer is loading names paths in utf8 and changeset object with repository codification cp1252

  2. Marcin Kuzminski repo owner

    whoosh requires everything unicode, somewhere there's a transformation and encoding mismatch, generally if you stick to utf8 it works, problems starts when you mix encodings, that's why i asked if this is not the case here ?

  3. Ricardo Cardona Ramirez reporter

    For now my choice is to index everything it can, in this way, if an error occurs then it will be skipped and proceed to the next file, so at least the process will have most of the information repository

  4. Log in to comment