OpenXML Filter: DOCX: Missing inline code for fields at the beginning of text units

Issue #1322 open
Kai Wang created an issue

OpenXML Filter fails to extract the inline code for fields at the beginning of text units from DOCX files. In comparison, if any texts (e.g. “abc“ in the following sample) were prepended to the text unit so that the field was not at the beginning anymore, the inline code could be extracted normally.

The sample code and DOCX files are included in the attachment file “sample-okapi.zip”.

The output of the code:

> mvn clean package exec:java -Dexec.mainClass="com.mycompany.Main"
[INFO] Scanning for projects...
...
========== Parsing missing_inlinecode.docx ==========
[com.mycompany.Main.main()] INFO net.sf.okapi.common.pipelinedriver.PipelineDriver - Input (No path available)
Text unit coded text: " shows how the data is organized in the PAT Module Dashboard."
Text unit coded text: "\uE101\uE110T\uE102\uE111able \uE103\uE112\uE103\uE113. PAT Module Dashboard"
Text unit coded text: "Kai Wang"

========== Parsing normal.docx ==========
[com.mycompany.Main.main()] INFO net.sf.okapi.common.pipelinedriver.PipelineDriver - Input (No path available)
Text unit coded text: "abc\uE103\uE110 shows how the data is organized in the PAT Module Dashboard."
Text unit coded text: "\uE101\uE110T\uE102\uE111able \uE103\uE112\uE103\uE113. PAT Module Dashboard"
Text unit coded text: "Kai Wang"

Note that the first text unit is missing the inline code before “ shows how the data is …“ in the file “missing_inlinecode.docx“.

Comments (6)

  1. Denis Konovalyenko

    @Kai Wang @jhargrave-straker as far as I remember, that was an intentional change. So, if a run does not contain visible text and the current fragment is empty, the code is marked as hidden, goes to the skeleton and then written on merge. This improves the segmentation quality. A related pull request #297.

    As usually we can introduce a conditional parameter to have both ways of extraction. What do you think?

  2. Kai Wang reporter

    @Denis Konovalyenko , thank you for the update! I personally agree with you about introducing a conditional parameter to have both ways of extraction.

  3. jhargrave-straker

    There are cases where you want to preserve the inline formatting in the segment so it can be altered in the translation (I assume this is the case here). For example, CodeSimplifier has an option to trim inline codes or preserve them as-is. I think an option would be a good idea in this case.

  4. Log in to comment