Issue #964 resolved

Cyrillic + utf8 + rtf = bug

Anonymous created an issue

Steps to reproduce:

  1. Create a utf8 file named 'test.php' and put this inside:
<?php
// определить сумму для каждой строки
  1. Run 'pygmentize -l php -f rtf -O style=encoding=utf-8 test.php > result.rtf'
  2. Open the file in RTF viewer (I\ve tested in TextEdit under MacOS)

Expected result:

You see the text you've provided.

Actual result:

<?php
// о¾п¿р€еµд´еµл»и¸т‚ьŒ суƒм¼м¼уƒ д´л»я кºа°ж¶д´о¾й¹ ст‚р€о¾кºи¸

As you might see, the result contains one extra crappy character per a correct one.

Comments (7)

  1. Andrew Pinkham

    I can replicate the error described here with:

    pygmentize -l text -f rtf -o result.rtf -O encoding=utf-8 test.php.

    There are two problems with the RTF formatter. pull request #321 fixes the first.

    Quick review: Starting in RTF version 1.5, there are three ways to display a character glyph:

    1. in ASCII (7 bits)
    2. using code points for a code page, of the form \'xx, where xx is hex (this is an 8 bit extension of ASCII)
    3. unicode, via an escape sequence \ud{\uNA}, where N is an integer, and A is ASCII or code point allowing legacy programs a fallback

    The first problem stems form a misuse of Item 3. The A in \ud{\uNA} being output is neither an ASCII character nor a code point. For example, given €оп, the formatter outputs:

    \ud{\u8364\'a4}\ud{\u1086о}\ud{\u1087п}

    Observe how non-ASCII characters are used as fallbacks for the second two glyphs. The patch corrects the output to the following, using '?' as a fallback.

    \ud{\u8364\'a4}\ud{\u1086?}\ud{\u1087?}

    Problem 1 solved.

    Problem 2 has to do with encoding and code pages. Because of the definition of RTF, the encoding and outencoding options are somewhat meaningless. However, an RTF should specify a code page: files output by Pygments do not. Editors that understand unicode in RTF (as of 1.5, ratified in 1997) will not need this, but at that point, \ud could be removed in favor for just \u.

    I will attach two patches, at the discretion of the core maintainers.

    The first patch maintains legacy compatibility by using the code page iso-8859-15, used because of its similarity to windows-1252. It should be very easy to modify this patch to use any other code page, and I am happy to do so upon request.

    The second patch does not maintain legacy compatibility, and switches to using only Unicode characters without fallback.

    Both patches work in Python 2 and 3, and effectively support all the unicode characters I tested, including surrogate pairs for glyphs outside of the 16-bit limit imposed by the format. (鼖, or u'\U0002fa1b', is correctly printed, for example).

    In both cases, encoding and outencoding are ignored by the RTF formatter.

    Note that it is possible to have the encoding and outencoding be used to specify the code page of the file. The problem with this approach stems from the RTF header, because each encoding is mapped to a magic number. For example, using iso-8859-15 results in the header \rtf1\ansi\ansicpg28605. The 28605 number refers to the code page. Using encoding and outencoding would thus necessitate building a table that maps encoding names to RTF code page numbers. Given the central use of UTF-8, and the fact that unicode support has existed since 1997, I do not see the benefit of doing this.

    Aside: Hysterically, RTF 1.5 defines a code page for UTF-8 with surrogates. It is possible to thus use \ud with the intention of supporting legacy RTF standards, without actually doing so.

  2. Andrew Pinkham

    I am still unable to attach the patches to this issue, and I do not expect BitBucket to fix the problem anytime soon, as the two issues about the problem have been marked invalid (see link above).

    I've uploaded the files to my own server. If someone would download them and attach them to the issue, that would be very helpful.

    (links redacted - see pull request issued below)

  3. Tim Hatch

    Thanks! I prefer the utf-8 version; from looking at the date on the specs from Wikipedia, UTF-8 support has been in there for a long time. Andrew, I assume most of your testing has been on Mac's Preview?

    Please propose a pull request containing utf8.patch along with some sort of test... how expansive you want that to be is up to you, but I'd suggest that one that simply generates a .rtf file in a temp dir is enough; if it fails to generate, we've learned something, and it's an artifact that a human can examine in viewers to verify.

  4. Andrew Pinkham

    Really glad you asked for tests. Turns out the patch above isn't quite as correct as I'd thought (I'd forgotten to account for the fact that RTF limits unicode escape sequences to 16 bits, meaning that the 32-bit numbers output in Python 3 were being truncated by Text Edit and Libre Office, leading to incorrect results.)

    Please see pull request #338 for the update to the patch, as well as a slew of tests for the issue.

  5. Log in to comment