patch: Auto detect/convert encoding of file contents

Issue #49 new
Ba Manzi created an issue

As encodings other than UTF-8 (mostly GB2312/GBK) are widely used in our codes in the repositories, I added some code to auto detect/convert the encoding of files. Now the file view display contents correctly (with or without annotations).

I don't have much expertise on Python programming, hope it might be helpful for some users.

Note: package chardet needed.

diff -r d17e88a1a88a kallithea/lib/vcs/                       
--- a/kallithea/lib/vcs/        Thu Aug 21 23:48:50 2014 +0200
+++ b/kallithea/lib/vcs/        Mon Oct 13 09:59:38 2014 +0800
@@ -290,6 +290,13 @@

         if bool(content and '\0' in content):
             return content
+        if type(content)=='str':
+            import chardet
+            ret = chardet.detect(content)
+            if ret['confidence'] > 0.7:
+                return safe_unicode(content, ret['encoding'])
         return safe_unicode(content)


Comments (1)

  1. Mads Kiilerich

    Thanks for sharing.

    I guess you did some last minute editing and actually meant type(content) == str without quotes.

    Instead I suggest using isinstance(content, str)

    I am not fond of having guessing involved in the default configuration - it often breaks down in unexpected ways. But it could perhaps fit everybody if the confidence threshold was configurable.

  2. Log in to comment