Commits

Georg Brandl committed 6747cfb

Closes #692: when using chardet mode, look for BOM and default to replace errorhandling.

  • Participants
  • Parent commits c728eb7

Comments (0)

Files changed (1)

File pygments/lexer.py

            'LexerContext', 'include', 'bygroups', 'using', 'this']
 
 
+_encoding_map = [('\xef\xbb\xbf', 'utf-8'),
+                 ('\xff\xfe\0\0', 'utf-32'),
+                 ('\0\0\xfe\xff', 'utf-32be'),
+                 ('\xff\xfe', 'utf-16'),
+                 ('\xfe\xff', 'utf-16be')]
+
 _default_analyse = staticmethod(lambda x: 0.0)
 
 
                     raise ImportError('To enable chardet encoding guessing, '
                                       'please install the chardet library '
                                       'from http://chardet.feedparser.org/')
-                enc = chardet.detect(text)
-                text = text.decode(enc['encoding'])
+                # check for BOM first
+                decoded = None
+                for bom, encoding in _encoding_map:
+                    if text.startswith(bom):
+                        decoded = unicode(text[len(bom):], encoding,
+                                          errors='replace')
+                        break
+                # no BOM found, so use chardet
+                if decoded is None:
+                    enc = chardet.detect(text[:1024]) # Guess using first 1KB
+                    decoded = unicode(text, enc.get('encoding') or 'utf-8',
+                                      errors='replace')
+                text = decoded
             else:
                 text = text.decode(self.encoding)
         # text now *is* a unicode string