Open XML filter for Word doc generates [#$dp] segments prefixed with Text Units.

Issue #419 resolved
Former user created an issue

Original issue 419 created by 143.ravik... on 2014-10-07T02:30:27.000Z:

The two word files attached has exactly the same contents but they produce two different types of text units content -

Example3.docx generates -

[#$dp2]<w:r><w:rPr><w:rFonts w:ascii="Arial" w:hAnsi="Arial"/></w:rPr><w:t xml:space="preserve">I am a simple text. What do you </w:t></w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/><w:r><w:rPr><w:rFonts w:ascii="Arial" w:hAnsi="Arial"/></w:rPr><w:t>think?</w:t></w:r>

This has an a meta data tag related to "Document Part" event prefixed with it.

The Open XML filters fails to merge it back correctly and the translated document fails to open correctly.

At the same time since the Text Unit generated for the Example 2 does not have the "[#$dp2]" and works as expected.

Is there any reason or sequence the filters parses the many xml files inside a given .docx file ?

Thanks

Comments (9)

  1. Former user Account Deleted

    Comment 1. originally posted by 143.ravik... on 2014-10-08T02:23:08.000Z:

    The above example in comment 1 was on Okapi version 0.24.

    Updated the Okapi version to 0.26 and parsed the attached file - input.docx.
    The text content was pretty simple with an image embedded.

    The text units generated were -

    [#$dp2][#$dp3]Some results may have been blocked under EU data protection law. </w:t></w:r>[#$dp4]Learn more</w:t></w:r>

    Picture 1

    <w:r><w:rPr><w:rFonts w:ascii="Arial" w:hAnsi="Arial" w:cs="Arial"/><w:b/><w:i/><w:sz w:val="22"/><w:szCs w:val="22"/></w:rPr><w:t>There is also a text after image</w:t></w:r>

    The output file generated complains about invalid contents and fails to open.

    Microsoft word for Mac 2011 is the platform.

  2. Former user Account Deleted

    Comment 2. originally posted by @ysavourel on 2014-10-08T13:12:43.000Z:

    I've tried the input.docx with both Tikal and Rainbow extraction and merged back: I got back a docx file with no error, and the bunch of flower image.
    The [dp...] markers are there, but as inline codes (as expected) and i don't see them in the text of the merged document.
    I've also tried the two other examples without issues.

    I'm not ruling out a bug: i just can't reproduce it for now.

    Question:
    - what tool are you using for the extraction/merging? and what options (if applicable)?

    Thanks,
    -ys

  3. Former user Account Deleted

    Comment 3. originally posted by 143.ravik... on 2014-10-09T04:05:55.000Z:

    I am using an OKAPI pipe line to generate the text units for extraction/merging.

    This extraction part works fine as it generates the required text units.
    (Hoping that there isn't any place holders required to mark an image location in any of the text units source)

    The issue seems while merging it back using the same RawDocument.

    The overridden OpenXMLZipFilterWriter and OpenXMLFilter file is attached.

    The only option modified here is - setBPreferenceTranslateDocProperties(false);

    The filter starts by processing the docx(Word_Image2.docx) files in the order, while merging -
    1. [Content-Types].xml
    2. words/style.xml
    3. word/document.xml
    4. word/setting.xml

    The document.xml generates the following text units of the attached docx file.

    1. This is a simple text
    2. Picture 1

    There is also an untranslatable text unit generated with textUnit.getSource().hasText() == false : -

    textUnit = (net.sf.okapi.common.resource.TextUnit) <w:r><w:rPr><w:noProof/><w:lang w:eastAsia="zh-TW"/></w:rPr><w:drawing><wp:inline distT="0" distB="0" distL="0" distR="0" wp14:anchorId="3F1AD5DF" wp14:editId="038BA628"><wp:extent cx="3492500" cy="2324100"/><wp:effectExtent l="0" t="0" r="12700" b="12700"/><wp:docPr id="2" [#$tu3]/><wp:cNvGraphicFramePr><a:graphicFrameLocks xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" noChangeAspect="1"/></wp:cNvGraphicFramePr>[#$sg1]</wp:inline></w:drawing></w:r><w:bookmarkStart w:id="0" w:name="_GoBack"/><w:bookmarkEnd w:id="0"/>

    The source and target docx files entries are exactly the same except the following different in the - "word/document.xml"

    Source -

    <wp:docPr id="2" name="Picture 1"/><wp:cNvGraphicFramePr><a:graphicFrameLocks xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" noChangeAspect="1"/></wp:cNvGraphicFramePr><a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"><a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:nvPicPr><pic:cNvPr id="0" name="Picture 1"/>

    Target -

    wp:docPr id="2" -ERR:REF-NOT-FOUND-/><wp:cNvGraphicFramePr><a:graphicFrameLocks xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main" noChangeAspect="1"/></wp:cNvGraphicFramePr><a:graphic xmlns:a="http://schemas.openxmlformats.org/drawingml/2006/main"><a:graphicData uri="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:pic xmlns:pic="http://schemas.openxmlformats.org/drawingml/2006/picture"><pic:nvPicPr><pic:cNvPr id="0" -ERR:REF-NOT-FOUND-/>

  4. Former user Account Deleted
    • changed status to open

    Comment 4. originally posted by @ysavourel on 2014-10-09T04:32:15.000Z:

    "-ERR:REF-NOT-FOUND-"

    This is a merge error string. I'll see if I can debug this using our tkit integration test. Possible the latest changes in m27 will give different results.

  5. Former user Account Deleted

    Comment 5. originally posted by @ysavourel on 2014-10-09T17:25:45.000Z:

    I have confirmed that with the latest m27 - with default configuration that all attached files extract and merge without problems. I also tried with "openXmlFilter.setBPreferenceTranslateDocProperties(false);" and go the same results.

    This is with and without segmentation.

    I'm leaning toward either (1) this bug has been fixed in m27 or (2) there is something else in the pipeline or custom derived filter and writer causing the problem.

    Can you retry your tests with the latest m27-SNAPSHOT? If that doesn't work please tell us the exact steps in your pipeline.

  6. Former user Account Deleted

    Comment 6. originally posted by 143.ravik... on 2014-10-10T03:54:32.000Z:

    I tried with the m27-SNAPSHOT version but got the same results on my side. Still seeing the "-ERR:REF-NOT-FOUND-" in the derived document.xml

    Attaching my pipe line steps -
    For Extraction -

    1. ExtractionStep.java - It has the pipe line details m using. Its composed of the DocTubStep which is used to store the extracted Text Units in DB. Tried both with segmentation and metrics steps on/off.

    For Merging -

    1. MergeService.java - It has the pipe line details used for merging. It is composed of the TranslateStep.java (to fetch the translations from DB for the extracted Text Units of the RawDocument and a FilterEventsStreamWriterStep , which is decorated with the MSFilterWriter to write back the translated contents into a OutputStream.

    Both the Extraction and Merging steps uses the same - WordFileFormat,MSFilter and MSFilterWriter.

  7. Former user Account Deleted

    Comment 7. originally posted by @ysavourel on 2014-10-10T10:47:23.000Z:

    The symptoms of the issue look like the ReferenceFlag info of the inline codes that have references is not set properly.

    If the getData() of a Code has one or more markers starting with "[#$", that code must have the reference flag set to true (code.setReferenceFlag(true)).

    It looks like the events are saved in some kind of DB store in this pipeline. Maybe that information is not saved properly and is missing when merging back?

  8. Former user Account Deleted

    Comment 8. originally posted by 143.ravik... on 2014-10-11T02:06:12.000Z:

    In tried a word doc where do not have text units having markers - "[#$"
    They are simple one like -

    "This is a simple text" and one more text unit for the name of the picture
    "Picture 1"

    I see the a similar output with "-ERR:REF-NOT-FOUND-" I don't store the events in any DB store, in fact the events are generated fresh for Extraction and Merge by each pipe line.

    I looked at a few test cases on -http://code.google.com/p/okapi/source/browse/okapi/filters/openxml/src/test/java/net/sf/okapi/filters/openxml

    Not sure what else could be missing or corrupting the merge back.

  9. Log in to comment