Commits

gbrandl  committed 0fd2710

[svn] Support for encoding guessing.

  • Participants
  • Parent commits 0db6b22
  • Branches trunk

Comments (0)

Files changed (5)

 -----------
 (released Nov XX, 2006)
 
+- Support for guessing input encoding added.
+
 - Encoding support added: all processing is now done with Unicode
-  strings, input and output are converted from and to byte strings.
+  strings, input and output are converted from and to byte strings
+  (see the ``encoding`` option of lexers and formatters).
 
 - Some improvements in the C(++) lexers handling comments and line
   continuations.
 for 0.6
 -------
 
-- document encodings
+- more setuptools entrypoints (html formatter etc.)
+  see paste script's Commands
 
-- guess encoding support?
+- pygmentize presets?
+
+- short cmdline options for common -O options
 
 - html formatter: full document, external css file?
 

File docs/src/formatters.txt

 Common options
 ==============
 
+All formatters support this option:
+
+`encoding`
+    If given, must be an encoding name (such as ``"utf-8"``). This will
+    be used to convert the token strings (which are Unicode strings)
+    to byte strings in the output (default: ``"latin1"``).
+    It will also be written in an encoding declaration suitable for the
+    document format if the `full` option is given (e.g. a ``meta
+    content-type`` directive in HTML or an invocation of the `inputenc`
+    package in LaTeX).
+
 The `HtmlFormatter` and `LatexFormatter` classes support these options:
 
 `style`

File docs/src/lexers.txt

 `tabsize`
     If given and greater than 0, expand tabs in the input (default: ``0``).
 
+`encoding`
+    If given, must be an encoding name (such as ``"utf-8"``). This encoding
+    will be used to convert the input string to Unicode (if it is not already
+    a Unicode string). The default is ``"latin1"``.
+
+    If this option is set to ``"guess"``, a simple UTF-8 vs. Latin-1
+    detection is used, if it is set to ``"chardet"``, the
+    `chardet library <http://chardet.feedparser.org/>`__ is used to
+    guess the encoding of the input.
+
 
 These lexers are builtin and can be imported from
 `pygments.lexers`:

File pygments/lexer.py

         If given, must be an encoding name. This encoding will be used to
         convert the input string to Unicode, if it is not already a Unicode
         string. The default is to use latin1 (default: 'latin1').
+        Can also be 'guess' to use a simple UTF-8 / Latin1 detection, or
+        'chardet' to use the chardet library, if it is installed.
     """
 
     #: Name of the lexer
         if isinstance(text, unicode):
             text = u'\n'.join(text.splitlines())
         else:
-            text = '\n'.join(text.splitlines()).decode(self.encoding)
+            text = '\n'.join(text.splitlines())
+            if self.encoding == 'guess':
+                try:
+                    text = text.decode('utf-8-sig')
+                except UnicodeDecodeError:
+                    text = text.decode('latin1')
+            elif self.encoding == 'chardet':
+                try:
+                    import chardet
+                except ImportError:
+                    raise ImportError('To enable chardet encoding guessing, please '
+                                      'install the chardet library from '
+                                      'http://chardet.feedparser.org/')
+                enc = chardet.detect(text)
+                text = text.decode(enc['encoding'])
+            else:
+                text = text.decode(self.encoding)
         if self.stripall:
             text = text.strip()
         elif self.stripnl: