Issue #822 resolved

Problem of BOM when file is encoded in utf-8

Andy Li
created an issue

The following demo shows the BOM byte and it is highlighting as error: http://pygments.org/demo/59241/

The file is also attached here.

Comments (6)

  1. Eric Knibbe

    In the Lasso lexer, I've gotten around this by having get_tokens_unprocessed strip the BOM out beforehand (though that change hasn't been pulled in yet).

    text = text.lstrip(u'\xef\xbb\xbf\ufeff')

  2. Andy Li reporter

    Thanks for your info!

    However it sounds to be not the "right" place to put the fix. The BOM should be removed before passing to the lexer, but get_tokens_unprocessed is call after the lexer tokenized the input. Also it is bad idea to put the fix in all lexer classes.

  3. Eric Knibbe

    I know it's a hack; ordinarily I'd have ignored the issue, except that BOM-prefixed files of Lasso are common since earlier versions of it required the BOM to read a file as UTF-8. Hopefully this gets fixed upstream so I can take that line out.

  4. Log in to comment