Merging error with nested <w:p> in OpenXML filter

Issue #324 resolved
Former user created an issue

Original issue 324 created by @ysavourel on 2013-04-05T12:57:45.000Z:

--- Report from Dmytro:

After further analyse of problem we found that the cause was in AlternateContent blocks which contain wps:txbx tags.
And actually, we already solved this issue for our file . Fixes are attached. But it could be indication of a more general problem with alternative content. So, maybe you should pay attention to it.
You can reconstruct a problem with use of mock docx, where document.xml has structure something like:

<?xml version="1.0" encoding="utf-8" standalone="yes"?>
<w:document>
<w:body>
<w:p>
<mc:AlternateContent>
<wps:txbx>
<w:p>
</w:p>
</wps:txbx>
</mc:AlternateContent>
</w:p>
</w:body>
</w:document>

And see what happen with such document.xml after split and merge in tikal.
See also attached files.

--- Input from Chase:

I am pretty sure I've run into DOCX files like this before, although I don't remember how to produce them. Basically, <w:p> is the OOXML paragraph container element. Very rarely, one paragraph can be nested within another, which means that the paragraph context in the filter needs to be pushed onto a stack and then the inner paragraph handled. It sounds like the okapi filter doesn't expect this, and so it just has a flag to tell whether we are "in a paragraph". So the filter's document state will break as soon as we leave the inner paragraph, since there's no depth count or stack, etc.

I don't have time to look at the filter right now - the simplest fix might be to try to just replace the boolean with an int that counts the depth of <w:p> tags we have seen. However, if there's any other paragraph-specific state (like paragraph style data), this will not be enough.

Comments (3)

  1. Log in to comment