OpenXML: Word glossary entries not processed

Issue #514 new
Sebastian Ebert created an issue

Word glossaries/indexes are not processed correctly. The behaviour is the following: Rainbow extracts the glossary for translation, but does not replace the "XE" entries on the target file afterwards. As soon as one presses F9 to update the glossary/index, the glossary is overwritten with the original words/phrases. Please see the sample file and screenshots for further explanation.

Comments (2)

  1. Chase Tingley

    This behavior has changed in m29-snapshot, but is still not correct.

    The issue is how indexes are constructed in Word. In the source docx, we have stuff like this:

        <w:r w:rsidRPr="00E77FAC">
            <w:rPr>
              <w:noProof/>
            </w:rPr>
            <w:t>This phrase</w:t>
          </w:r>
          <w:r w:rsidRPr="00E77FAC">
            <w:rPr>
              <w:noProof/>
            </w:rPr>
            <w:fldChar w:fldCharType="begin"/>
          </w:r>
          <w:r w:rsidRPr="00E77FAC">
            <w:rPr>
              <w:noProof/>
            </w:rPr>
            <w:instrText xml:space="preserve"> XE "phrase" </w:instrText>
          </w:r>
          <w:r w:rsidRPr="00E77FAC">
            <w:rPr>
              <w:noProof/>
            </w:rPr>
            <w:fldChar w:fldCharType="end"/>
          </w:r>
    

    The fldChar stuff is a set of field codes, basically macros. Each index entry has one (delivering the "XE" instruction), and then at the end of the document there's another big one that contains the field that define the index and caches its current value.

    In M28, our parsing of these is pretty bad. We extract text from inside field codes in some cases, and that text is just cached values that are overwritten when you refresh the document, as you mentioned. That stuff shouldn't be translatable since it's not real data.

    In M29-snapshot, we've fixed the field code parsing, so the values in the cached index aren't exposed for translation. However, that's only half the fix. The XE instruction is not exposed for translation, so that even if you translate "phrase" to "phrasezzzz", the XE code is unchanged:

          <w:r>
            <w:t xml:space="preserve">This phrasezzzz</w:t>
          </w:r>
          <w:r>
            <w:rPr/>
            <w:fldChar w:fldCharType="begin"/>
          </w:r>
          <w:r w:rsidRPr="00E77FAC">
            <w:rPr>
              <w:noProof/>
            </w:rPr>
            <w:instrText xml:space="preserve"> XE "phrase" </w:instrText>
          </w:r>
    

    This means that the index won't rebuild correctly to pick up the change.

    We probably need to extract the field argument for XE codes for translation. This is ugly, because the full field code value is a nasty macro syntax (see section 17.16.5.72 in the ECMA reference). For example:

    XE "behavior:implementation-defined" \b
    

    This is actually an index entry "behavior" with a sub-entry "implementation-defined", as well as a formatting code ("\b"). Correct handling of this would expose two separate text units, "behavior" and "implementation-defined", hide the "XE" and "\b" portions, and then reassemble everything on merge.

  2. Log in to comment