Merging error with nested <w:p> in OpenXML filter
Original issue 324 created by @ysavourel on 2013-04-05T12:57:45.000Z:
--- Report from Dmytro:
After further analyse of problem we found that the cause was in AlternateContent blocks which contain wps:txbx tags.
And actually, we already solved this issue for our file . Fixes are attached. But it could be indication of a more general problem with alternative content. So, maybe you should pay attention to it.
You can reconstruct a problem with use of mock docx, where document.xml has structure something like:
<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document>
<w:body>
<w:p>
<mc:AlternateContent>
<wps:txbx>
<w:p>
</w:p>
</wps:txbx>
</mc:AlternateContent>
</w:p>
</w:body>
</w:document>
And see what happen with such document.xml after split and merge in tikal.
See also attached files.
--- Input from Chase:
I am pretty sure I've run into DOCX files like this before, although I don't remember how to produce them. Basically, <w:p> is the OOXML paragraph container element. Very rarely, one paragraph can be nested within another, which means that the paragraph context in the filter needs to be pushed onto a stack and then the inner paragraph handled. It sounds like the okapi filter doesn't expect this, and so it just has a flag to tell whether we are "in a paragraph". So the filter's document state will break as soon as we leave the inner paragraph, since there's no depth count or stack, etc.
I don't have time to look at the filter right now - the simplest fix might be to try to just replace the boolean with an int that counts the depth of <w:p> tags we have seen. However, if there's any other paragraph-specific state (like paragraph style data), this will not be enough.
Comments (3)
-
Account Deleted -
Account Deleted Comment 3. originally posted by @ysavourel on 2013-05-04T19:33:27.000Z:
Can be related to issue
#323(http://code.google.com/p/okapi/issues/detail?id=323&colspec=ID%20Type%20Status%20Priority%20Owner%20Summary%20Component&start=100).Point 3 there also mentions <wps:txbx/>.
-
Account Deleted - changed status to resolved
Comment 4. originally posted by @ysavourel on 2013-07-01T16:51:55.000Z:
- Log in to comment
Comment 1. originally posted by @ysavourel on 2013-04-05T12:57:57.000Z: