DOCX file with AlternateContent in document.xml doesn't merge back properly
Original issue 166 created by @ysavourel on 2011-02-28T12:13:48.000Z:
The attached example file contains an emelment mc:AlternateContent that seems to cause processing issue. The output file does not open in Word.
(This bug is forward from http://sourceforge.net/tracker/?func=detail&atid=434659&aid=3194947&group_id=42949)
The reported issue occurs when trying to do a simple round-trip.
Comments (8)
-
Account Deleted -
Account Deleted Comment 2. originally posted by @ysavourel on 2013-04-27T21:26:19.000Z:
Taking this.
-
Account Deleted Comment 3. originally posted by @ysavourel on 2013-05-06T04:29:46.000Z:
Yikes, the filter makes a mess of word/document.xml. I started doing a comparison of the the original and roundtrip forms, but it's not immediately useful - like looking at a shuffled deck of cards. I will need to step through to see what happens when we try to parse this content.
-
Account Deleted - changed status to open
Comment 4. originally posted by @ysavourel on 2013-05-06T04:30:06.000Z:
-
Account Deleted Comment 5. originally posted by @ysavourel on 2013-05-06T06:00:59.000Z:
From the spec:
17.17.3 Roundtripping Alternate Content
Office Open XML defines a mechanism for the storage of content which is not defined by ECMA-376, for
example extensions developed by future software applications which leverage the Office Open XML formats.
This mechanism allows for the storage of a series of alternative representations of content, of which the
consuming application should use the first alternative whose requirements are met.
<<<<I had never encountered this before. Basically, the content (which may include translatable text) is encoded in two (or more?) different ways, one of which (the "fallback" method) follows standard OpenXML, and the others of which rely on some sort of extension. (In the attached example, the extension method is called "wps", which is some sort of art format, possibly related to MS Works.)
The filter can't expect to understand all the extensions. However, where possible it should still be able to extract translatable text from the extended markup if the text is encoded in a recognizable way. In the attached example, the text is still encoded as a regular <w:p> tag, so we could conceivably parse it out.
What is a little strange is that for localization I think the correct behavior is to extract the translatable text from all the possibilities in the <AlternateContent> section. This means that we may expose the same text multiple times redundantly, but if we don't do it that way, the target file may be inconsistently translated depending on how it is processed.
Also, since the <AlternateContent> options may contain markup that the filter doesn't always completely handle (eg, WordArt), the first step here is just getting it so that the filter doesn't mangle the file.
-
Account Deleted - changed status to resolved
Comment 6. originally posted by @ysavourel on 2013-06-28T19:59:08.000Z:
Filter extracts text from mc:Fallback and mc:Choice Requites="wps". It also handles WordArt, TextArt, and Watermarks.
-
Account Deleted Comment 7. originally posted by @ysavourel on 2013-06-28T20:11:38.000Z:
Thanks Dan!
-
Account Deleted Comment 8. originally posted by @ysavourel on 2013-06-28T20:41:40.000Z:
It looks like some of the integration tests fail:
Failed tests:
TikalTest.testExtractMergeDOCX:201 File different from gold
TikalTest.testExtractSegmentMergeDOCX:214 File different from gold
Maybe some gold files not up-to-date with the new extraction.
In the okapi-applications-integration-tests project. - Log in to comment
Comment 1. originally posted by karlis.ged... on 2013-04-26T14:42:13.000Z:
This bug is he reason why okapi Framework cannot be used in the project I am working on. I would appreciate if this would be fixed.
These "mc:AlternateContent" tags are created adding Text Boxes, Comments and Equations which are very common.