Commit messages with Cyrillic characters are shown incorrectly

Issue #249 wontfix
Ruslan Yushchenko
created an issue

Commit messages that contain Cyrillic characters are shown as an unknown character. Please, see the attached picture.

My Configuration: Windows+Python26+RhodeCode-1.1.8

Comments (10)

  1. Marcin Kuzminski repo owner

    It's not utf8 that's why, RhodeCode in 1.1.8 only allows non ascii chars to be in utf8. in 1.2 we will aproach this problem in a way that users can use fallback encoding if utf8 fails..

  2. Ruslan Yushchenko reporter

    Ok, I got it. But Mercurial can convert commit messages to UTF-8 on fly. For example,

    hg log --encoding UTF-8

    works perfectly on UTF-8 terminals. And

    hg log --encoding 1251

    works good in Windows using the same repository.

  3. Ruslan Yushchenko reporter

    It doesn't, sorry. I'll try to hack some more tomorrow. RhodeCode project is great and I definitely want to try it out. I'll try out beta-version first. If it still has that problem, I'll try to fix it and propose a patch. Thank you!

  4. Ruslan Yushchenko reporter

    Mercurial indeed stores all metadata in UTF-8 encoding, it uses local encoding to display messages only. So all I have to do is to tell Mercurial that my local encoding is UTF-8 instead of cp1251. There is an environment variable that does just that: "HGENCODING".

    So, before RhodeCode is started the variable needs to be set like this:

    SET HGENCODING=UTF-8
    paster serve production.ini
    

    That solves the problem with all commit messages, user names, branches, tags, etc. All now can use Cyrillic characters.

    But one problem still remains - files in a repository are stored in local encoding and Mercurial doesn't care what it is. So all diffs and raw files are displaying incorrectly. This issue is more confusing since neither Mercurial nor RhodeCode doesn't have a way to specify encoding of repository contents. Maybe adding a configuration option like "content_encoding=windows-1251" is good enough solution? Or maybe content encoding is better be specified per repository basis in RhodeCode's database?

  5. Ruslan Yushchenko reporter

    Now I finally found why diffs and files are displayed incorrectly. The function 'safe_unicode' is defined twice - one in rhodecode library, other in vcs/utils library. To work correctly local encoding should be set in both of these functions.

    Now everything works fine. I hope my little research will be useful in the future.

  6. Marcin Kuzminski repo owner

    In latest 1.2 i added chardet lib (if installed) detection if default utf8 fails for safe_unicode. I'm thinking also to add an variable in .ini file to define for server admin what encoding is he using, but this in my opinion is only good if you're using one and only one encoding...

    I cannot think of a good solution to make one point of control for both vcs and rhodecode, but VCSENCODING variable might be an option here

  7. Ruslan Yushchenko reporter

    Thank you! I just want to point out that Mercurial already stores it's metadata (commit messages, tag and branch names, etc.) in UTF-8. So there is no need to convert anything at all. Maybe it will require to use a bit lower interface to Mercurial though.

    On the other hand, files are stored in repositories as is without any changes no matter what encoding is used. So the configurable option is, in my opinion, should specify encoding only for files and diffs.

  8. Log in to comment