Docx <w:r><w:t> ignored while needed

Issue #333 resolved
Former user created an issue

Original issue 333 created by aurelien.tomass... on 2013-05-07T13:14:17.000Z:

I try to extract text from a simple docx file.
This file has just a text without formatting.

Its xml data is then:

<w:document ...>
<w:body>
<w:p w:rsidRDefault="00A163E0" w:rsidR="007514B6">
<w:r><w:t>Je suis un document simple</w:t></w:r>
<w:bookmarkStart w:name="_GoBack" w:id="0"/>
<w:bookmarkEnd w:id="0"/>
</w:p>
<w:sectPr w:rsidR="007514B6">
<w:pgSz w:w="11906" w:h="16838"/>
<w:pgMar w:gutter="0" w:footer="708" w:header="708" w:left="1440" w:bottom="1440" w:right="1440" w:top="1440"/>
<w:cols w:space="708"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
</w:document>

When extracting textUnit, i am waiting for:
<w:r><w:t>Je suis un document simple</w:t></w:r>
<w:bookmarkStart w:name="_GoBack" w:id="0"/>
<w:bookmarkEnd w:id="0"/>

But it contains only
Je suis un document simple
<w:bookmarkStart w:name="_GoBack" w:id="0"/>
<w:bookmarkEnd w:id="0"/>
because the filter thinks that there is no formattage information into <w:r><w:t> . ( http://code.google.com/p/okapi/source/browse/okapi/filters/openxml/src/main/java/net/sf/okapi/filters/openxml/OpenXMLContentFilter.java?name=html5&r=b61c220fb4a61a054bb85e26009a26d1c8053673 line 1681)

When using the filter writer, the document written is then

<w:document ...>
<w:body>
<w:p w:rsidRDefault="00A163E0" w:rsidR="007514B6">
Je suis un document simple
<w:bookmarkStart w:name="_GoBack" w:id="0"/>
<w:bookmarkEnd w:id="0"/>
</w:p>
<w:sectPr w:rsidR="007514B6">
<w:pgSz w:w="11906" w:h="16838"/>
<w:pgMar w:gutter="0" w:footer="708" w:header="708" w:left="1440" w:bottom="1440" w:right="1440" w:top="1440"/>
<w:cols w:space="708"/>
<w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
</w:document>

=> <w:r><w:t> and </w:t></w:r> are missing and this is not a valid docx file.

Comments (4)

  1. Chase Tingley

    This was fixed in M28. Incidentally, the test file is labeled as a .png file, but it's actually a valid .docx.

  2. Log in to comment