Accentuated characters improperly rendered (appear as `?`) on hg repositories

Issue #310 resolved
Romain DEP.
created an issue

Changesets with a description containing accentuated characters are shown with question marks in place of said characters.

  • repro url for hg: link
  • non-repro for git: link
  • kallithea version: 522cfb2be9e1
  • os & env: rpm -qa|grep "wsgi\|httpd" → httpd-2.4.29-1.fc27.x86_64 mod_wsgi-4.5.15-4.fc27.x86_64
  • python --version → Python 2.7.14
  • hg version → Mercurial Distributed SCM (version 4.4.2)

Comments (15)

  1. Thomas De Schampheleire

    What is the output of the locale command in the terminal where you start kallithea? Kallithea expects to be run in an UTF-8 environment.

    I cannot reproduce this problem, I see accentuated characters just fine in an hg repo.

    I have following output of locale:

    LANG=en_US.utf8
    LC_CTYPE="en_US.utf8"
    LC_NUMERIC="en_US.utf8"
    LC_TIME="en_US.utf8"
    LC_COLLATE="en_US.utf8"
    LC_MONETARY="en_US.utf8"
    LC_MESSAGES="en_US.utf8"
    LC_PAPER="en_US.utf8"
    LC_NAME="en_US.utf8"
    LC_ADDRESS="en_US.utf8"
    LC_TELEPHONE="en_US.utf8"
    LC_MEASUREMENT="en_US.utf8"
    LC_IDENTIFICATION="en_US.utf8"
    LC_ALL=
    

    There is probably also a way to not use utf-8 if you really don't want to, but I guess it's no problem for you?

  2. Romain DEP. reporter

    @Thomas De Schampheleire Hi! Thanks, I run kallithea through a wsgi script, that I amended with the following lines:

    with open('/path/to/kallithea-src/data/ktenv.txt', 'w+') as f:
      import subprocess
      f.write(subprocess.check_output(['locale']))
    

    which writes:

    LANG=C
    LC_CTYPE="C"
    LC_NUMERIC="C"
    LC_TIME="C"
    LC_COLLATE="C"
    LC_MONETARY="C"
    LC_MESSAGES="C"
    LC_PAPER="C"
    LC_NAME="C"
    LC_ADDRESS="C"
    LC_TELEPHONE="C"
    LC_MEASUREMENT="C"
    LC_IDENTIFICATION="C"
    LC_ALL=
    

    The WSGI script is spun-up by apache with the following conf:

        WSGIDaemonProcess kallithea threads=2
        WSGIProcessGroup kallithea
        WSGIScriptAlias / /path/to/kallithea-src/dispatch.wsgi process-group=kallithea
        WSGIPassAuthorization On
    

    But, hey, it seems that this change fixes it:

    -    WSGIDaemonProcess kallithea threads=2
    +    WSGIDaemonProcess kallithea threads=2 lang='en_US.UTF-8' locale='en_US.UTF-8'
    

    so, problem solved.

    I'll make this change into a documentation PR, if you think it would help future users. But as a general/future-proof fix, wouldn't it be better if kallithea were to set the encoding to utf-8 if not/improperly specified?

    @domruf : hg log is fine, that was purely a WSGI/env issue it seems.

  3. Mads Kiilerich

    The patch looks good - thanks.

    But I wonder how well it works on Windows?

    And would it perhaps be better to set environment variables?

    Or should we perhaps have a .ini setting for setting the locale?

  4. Thomas De Schampheleire

    @Mads Kiilerich I have no experience with deploying Kallithea on Windows and if there ever can be unicode issues.

    If it is possible to set the right settings from within Kallithea, perhaps based on an ini setting, I think it would be preferable over deployment-specific settings that are different for uwsgi, mod_wsgi, etc. or rely on admin settings like environment variables.

    If there are things dependent on the user environment, then we may want to add a 'test' page in the admin interface to verify that everything is fine, i.e. some text with various unicode characters and a description of what it should look like, or an image.

  5. Mads Kiilerich

    @Romain DEP.

    Can you confirm that you see the same problem if running a development server as gearbox serve -c my.ini ?

    Also, can you try to replace your wsgi lang configuration with

    --- a/kallithea/config/app_cfg.py
    +++ b/kallithea/config/app_cfg.py
    @@ -119,6 +119,9 @@ else:
     def setup_configuration(app):
         config = app.config
    
    +    os.environ['LANG'] = 'en_US.UTF-8'
    +    os.environ['LANGUAGE'] = 'en_US.UTF-8'
    +
    

    and see if that does the job ... also when running with gearbox?

  6. Romain DEP. reporter

    Hi @Mads Kiilerich ,

    • Serving through gearbox doesn't expose the issue at all (i.e. accentuated chars DO renders properly).
    • Unapplying the WSGI lang configuration AND applying the patch doesn't solve the original issue (i.e. despite os.environ being set, accentuated chars DO NOT render properly)

    so it looks pretty much like an apache-specific issue?

  7. Mads Kiilerich

    Can you try:

    --- a/kallithea/config/app_cfg.py
    +++ b/kallithea/config/app_cfg.py
    @@ -115,10 +115,13 @@ else:
         base_config['renderers'].append('kajiki')
         enable_debugbar(base_config)
    
    +import mercurial
    
     def setup_configuration(app):
         config = app.config
    
    +    mercurial.encoding.encoding = config.get('hgencoding', 'UTF-8')
    +
         if config.get('ignore_alembic_revision', False):
             log.warn('database alembic revision checking is disabled')
         else:
    

    It seems like the problem is caused by mercurial.encoding setting the default encoding at import time and being imported very early, before we get around to set environment variables. One way around it is thus to just patch it later. To avoid hardcoding it completely, give it the only meaningful default, and make it configurable but undocumented until we see the actual use for it.

    The direct mocking of mercurial should perhaps be encapsulated somewhere ... but I don't know where ...

  8. Mads Kiilerich

    Hmm. If doing something like this, I guess it should use default_encoding which already is mentioned in setup.rst.

    Also, should such a change be accompanied by documentation changes?

    But looking closer, I see that setup.rst already mentions setting HGENCODING in the dispatch script. It should perhaps be done more consistently (and in kallithea/lib/paster_commands/install_iis.py). That would be a more generic solution than tweaking the mod_wsgi configuration.

    @Romain DEP. what do you suggest? Could you provide follow-up PR with the perfect solution?

  9. Romain DEP. reporter

    Yeah, you are right, only the first of the two WSGI examples of setup.rst sets os.environ["HGENCODING"] = "UTF-8" and unfortunately, I had my conf based on the second example, hence the troubles.

    As it is enough to do the trick, I updated the PR accordingly.

    Not sure about install_iis.py, though, that's uncharted territory for me :)

  10. Log in to comment