pygment 06.dev and unicode

Issue #133 resolved
Former user created an issue

When creating self._tokens in class RegexLexerMeta you should also decode tokens, as it is done with sources files. Something like that:BR

{{{

!python

t = tdef[0] if not isinstance(t, unicode): try: import chardet except ImportError: raise ImportError('To enable chardet encoding guessing, please ' 'install the chardet library from ' 'http://chardet.feedparser.org/') enc = chardet.detect(t) t = t.decode(enc['encoding'])

rex = re.compile(t, rflags)

}}}

Because, if we have tokens in other lnguage, than English, they would not be recognized.BR And you also should do such thing (detect character set encoding before encoding) when formatting source code. Because for now it tries to encode in 'lating1' and returns errors.BR

{{{ Error while higlighting: 'latin1' codec can't decode characters in position 0-15: ordinal not in range (256) }}} Here are some files with a lexer for my language "1S" (or 1C in russian :)), please, include it into pygments package (you can also include them to the version 5.1. If lexer file and sources are in cp-1251 encoding everything goes fine). And source files (will be helpful for testing).

Reported by mikmiksuny4@tut.by

Comments (6)

  1. Former user Account Deleted

    I think that it would be right to give a user the opportunity to pass a parameter to formatter - what character coding you want as the result.BR And maybe to lexer, too.

  2. Log in to comment