DOCX file with AlternateContent in document.xml doesn't merge back properly

Former user Account Deleted

Comment 1. originally posted by karlis.ged... on 2013-04-26T14:42:13.000Z:

This bug is he reason why okapi Framework cannot be used in the project I am working on. I would appreciate if this would be fixed.
These "mc:AlternateContent" tags are created adding Text Boxes, Comments and Equations which are very common.

2013-04-26T14:42:13+00:00

Former user Account Deleted

Comment 2. originally posted by @ysavourel on 2013-04-27T21:26:19.000Z:

Taking this.

2013-04-27T21:26:19+00:00

Former user Account Deleted

Comment 3. originally posted by @ysavourel on 2013-05-06T04:29:46.000Z:

Yikes, the filter makes a mess of word/document.xml. I started doing a comparison of the the original and roundtrip forms, but it's not immediately useful - like looking at a shuffled deck of cards. I will need to step through to see what happens when we try to parse this content.

2013-05-06T04:29:46+00:00

Former user Account Deleted

changed status to open

Comment 4. originally posted by @ysavourel on 2013-05-06T04:30:06.000Z:

2013-05-06T04:30:06+00:00

Former user Account Deleted

Comment 5. originally posted by @ysavourel on 2013-05-06T06:00:59.000Z:

From the spec:

17.17.3 Roundtripping Alternate Content
Office Open XML defines a mechanism for the storage of content which is not defined by ECMA-376, for
example extensions developed by future software applications which leverage the Office Open XML formats.
This mechanism allows for the storage of a series of alternative representations of content, of which the
consuming application should use the first alternative whose requirements are met.
<<<<

I had never encountered this before. Basically, the content (which may include translatable text) is encoded in two (or more?) different ways, one of which (the "fallback" method) follows standard OpenXML, and the others of which rely on some sort of extension. (In the attached example, the extension method is called "wps", which is some sort of art format, possibly related to MS Works.)

The filter can't expect to understand all the extensions. However, where possible it should still be able to extract translatable text from the extended markup if the text is encoded in a recognizable way. In the attached example, the text is still encoded as a regular <w:p> tag, so we could conceivably parse it out.

What is a little strange is that for localization I think the correct behavior is to extract the translatable text from all the possibilities in the <AlternateContent> section. This means that we may expose the same text multiple times redundantly, but if we don't do it that way, the target file may be inconsistently translated depending on how it is processed.

Also, since the <AlternateContent> options may contain markup that the filter doesn't always completely handle (eg, WordArt), the first step here is just getting it so that the filter doesn't mangle the file.

2013-05-06T06:00:59+00:00

Former user Account Deleted

changed status to resolved

Comment 6. originally posted by @ysavourel on 2013-06-28T19:59:08.000Z:

Filter extracts text from mc:Fallback and mc:Choice Requites="wps". It also handles WordArt, TextArt, and Watermarks.

2013-06-28T19:59:08+00:00

Former user Account Deleted

Comment 7. originally posted by @ysavourel on 2013-06-28T20:11:38.000Z:

Thanks Dan!

2013-06-28T20:11:38+00:00

Former user Account Deleted

Comment 8. originally posted by @ysavourel on 2013-06-28T20:41:40.000Z:

It looks like some of the integration tests fail:
Failed tests:
TikalTest.testExtractMergeDOCX:201 File different from gold
TikalTest.testExtractSegmentMergeDOCX:214 File different from gold
Maybe some gold files not up-to-date with the new extraction.
In the okapi-applications-integration-tests project.

2013-06-28T20:41:40+00:00

Comments (8)