OpenXML: Word glossary entries not processed

Chase Tingley

This behavior has changed in m29-snapshot, but is still not correct.

The issue is how indexes are constructed in Word. In the source docx, we have stuff like this:

    <w:r w:rsidRPr="00E77FAC">
        <w:rPr>
          <w:noProof/>
        </w:rPr>
        <w:t>This phrase</w:t>
      </w:r>
      <w:r w:rsidRPr="00E77FAC">
        <w:rPr>
          <w:noProof/>
        </w:rPr>
        <w:fldChar w:fldCharType="begin"/>
      </w:r>
      <w:r w:rsidRPr="00E77FAC">
        <w:rPr>
          <w:noProof/>
        </w:rPr>
        <w:instrText xml:space="preserve"> XE "phrase" </w:instrText>
      </w:r>
      <w:r w:rsidRPr="00E77FAC">
        <w:rPr>
          <w:noProof/>
        </w:rPr>
        <w:fldChar w:fldCharType="end"/>
      </w:r>

The fldChar stuff is a set of field codes, basically macros. Each index entry has one (delivering the "XE" instruction), and then at the end of the document there's another big one that contains the field that define the index and caches its current value.

In M28, our parsing of these is pretty bad. We extract text from inside field codes in some cases, and that text is just cached values that are overwritten when you refresh the document, as you mentioned. That stuff shouldn't be translatable since it's not real data.

In M29-snapshot, we've fixed the field code parsing, so the values in the cached index aren't exposed for translation. However, that's only half the fix. The XE instruction is not exposed for translation, so that even if you translate "phrase" to "phrasezzzz", the XE code is unchanged:

      <w:r>
        <w:t xml:space="preserve">This phrasezzzz</w:t>
      </w:r>
      <w:r>
        <w:rPr/>
        <w:fldChar w:fldCharType="begin"/>
      </w:r>
      <w:r w:rsidRPr="00E77FAC">
        <w:rPr>
          <w:noProof/>
        </w:rPr>
        <w:instrText xml:space="preserve"> XE "phrase" </w:instrText>
      </w:r>

This means that the index won't rebuild correctly to pick up the change.

We probably need to extract the field argument for XE codes for translation. This is ugly, because the full field code value is a nasty macro syntax (see section 17.16.5.72 in the ECMA reference). For example:

XE "behavior:implementation-defined" \b

This is actually an index entry "behavior" with a sub-entry "implementation-defined", as well as a formatting code ("\b"). Correct handling of this would expose two separate text units, "behavior" and "implementation-defined", hide the "XE" and "\b" portions, and then reassemble everything on merge.

2015-12-11T22:25:41+00:00

Comments (2)

Chase Tingley
This behavior has changed in m29-snapshot, but is still not correct.

The issue is how indexes are constructed in Word. In the source docx, we have stuff like this:
```
    <w:r w:rsidRPr="00E77FAC">
        <w:rPr>
          <w:noProof/>
        </w:rPr>
        <w:t>This phrase</w:t>
      </w:r>
      <w:r w:rsidRPr="00E77FAC">
        <w:rPr>
          <w:noProof/>
        </w:rPr>
        <w:fldChar w:fldCharType="begin"/>
      </w:r>
      <w:r w:rsidRPr="00E77FAC">
        <w:rPr>
          <w:noProof/>
        </w:rPr>
        <w:instrText xml:space="preserve"> XE "phrase" </w:instrText>
      </w:r>
      <w:r w:rsidRPr="00E77FAC">
        <w:rPr>
          <w:noProof/>
        </w:rPr>
        <w:fldChar w:fldCharType="end"/>
      </w:r>
```
The fldChar stuff is a set of field codes, basically macros. Each index entry has one (delivering the "XE" instruction), and then at the end of the document there's another big one that contains the field that define the index and caches its current value.

In M28, our parsing of these is pretty bad. We extract text from inside field codes in some cases, and that text is just cached values that are overwritten when you refresh the document, as you mentioned. That stuff shouldn't be translatable since it's not real data.

In M29-snapshot, we've fixed the field code parsing, so the values in the cached index aren't exposed for translation. However, that's only half the fix. The XE instruction is not exposed for translation, so that even if you translate "phrase" to "phrasezzzz", the XE code is unchanged:
```
      <w:r>
        <w:t xml:space="preserve">This phrasezzzz</w:t>
      </w:r>
      <w:r>
        <w:rPr/>
        <w:fldChar w:fldCharType="begin"/>
      </w:r>
      <w:r w:rsidRPr="00E77FAC">
        <w:rPr>
          <w:noProof/>
        </w:rPr>
        <w:instrText xml:space="preserve"> XE "phrase" </w:instrText>
      </w:r>
```
This means that the index won't rebuild correctly to pick up the change.

We probably need to extract the field argument for XE codes for translation. This is ugly, because the full field code value is a nasty macro syntax (see section 17.16.5.72 in the ECMA reference). For example:
```
XE "behavior:implementation-defined" \b
```
This is actually an index entry "behavior" with a sub-entry "implementation-defined", as well as a formatting code ("\b"). Correct handling of this would expose two separate text units, "behavior" and "implementation-defined", hide the "XE" and "\b" portions, and then reassemble everything on merge.
- 2015-12-11T22:25:41+00:00
Jim Hargrave (OLD)
@Chase Tingley Has this been fixed? I added the attached file to the integration tests.
- 2021-03-10T22:29:08+00:00
Log in to comment