String containing unicode-like data gets encoded

Issue #726 new
jgbishop created an issue

I have the following literal string I'm trying to write to a cell in a spreadsheet:

SW_x3850_CPU

The _x3850_ portion of the string gets misinterpreted as a Unicode character, specifically what looks to me like a Chinese character. I tried using the set_explicit_value call, but the same issue occurs.

Comments (3)

  1. CharlieC

    I think this is actually a bug in Excel which uses a workaround to encode the first underscore so that it doesn't treat it as an escaped value. OpenOffice and LibreOffice certainly don't have that problem and note that they don't strip the escaping when reading the file.

    We do have some code when reading files that strips this unnecessary encoding so I guess we could look at adding the escaping when saving though I think I'll check with the OOXML WG on this.

    XML in Excel <t>SW_x005F_x3850_CPU</t> in openpyxl <t>SW_x3850_CPU</t>

  2. CharlieC

    Good news: this was passed onto the OOXML Working Group and the specification will be revised to cover this kind of case.

    Would be nice to get a PR (based on 2.4) that can correctly encode and decode this.

    Part 1: §22.9.2.19, “ST_Xstring (Escaped String)” String of characters with support for escaped invalid-XML characters. For all characters which cannot be represented in XML as defined by the XML 1.0 specification, the characters are escaped using the Unicode numerical character representation escape character format xHHHH, where H represents a hexadecimal character in the character's value. [Example: The Unicode character 8 is not permitted in an XML 1.0 document, so it must be escaped as x0008. end example]

    For each string matching the escape character format xHHHH, the first underscore character shall itself be escaped. [Example: In order for the string “SW_x3850_CPU” to be interpreted literally, it would be expressed as “SW_x005f_x3850_CPU” or “SW_x005F_x3850_CPU”. end example]

  3. Log in to comment