OpenXML: Text runs containing multiple text fragments + tabs lose content on merge

Issue #458 resolved
Chase Tingley created an issue

Testcase attached. This word file contains a slightly odd structure that more recent versions of Office may not produce any more (or it may be produced by LibreOffice or some other tool, I'm not sure). It has a run that contains multiple <w:t> fragments interspersed with <w:tab>. This is legal, but it breaks our filter. When you roundtrip this, the first "TEST" word is lost.

tikal.sh -fc okf_openxml -x tabs.docx
tikal.sh -fc okf_openxml -m tabs.docx.xlf

Comments (8)

  1. Chase Tingley reporter

    As a side note, I built this testcase by hand, based on an example seen in the wild, and if you save it out again in Office, the structure that causes problems goes away. So be careful not to do that if testing this bug.

  2. Luciano Coccia

    I faced the same problem and I think that the issue is caused by the fact that multiple <w:t> are present in a <w:r> tag, so the <w:tab> has not affect on it.

    I attached a simple docx example and the xlf output retrieved from tikal using this command:

    ./tikal.sh -x /home/ubuntu/Desktop/germanDocs/okapi_bug.docx -sl de -tl en
    

    Thanks!

  3. Chase Tingley reporter

    Ah, your theory makes sense Luciano. I suspect that the OpenXMLContentFilter code is only expecting there to be one <w:t> in a <w:r>, since it is uncommon for there to be more than one.

  4. Luciano Coccia

    Yes Chase, I think that's the point.

    As you said it's not common to have more than one <w:t>, indeed our bad docx came from an OCR service.

  5. Chase Tingley reporter

    I've gone deep enough into this that I can confirm it's the same issue. The culprit is OpenXMLContentFilter.combineRepeatedFormat(), which is a nasty piece of work. I am working to figure out the best way to deal with this.

  6. Chase Tingley reporter

    Fix Issue 458, Fix Issue 467, and Fix Issue 473 in the openxml filter

    This rewrites OpenXMLContentFilter.combineRepeatedFormat() and
    splits out the markup simplification content to a new class called
    ParagraphSimplifier.  This fixes many issues in the old code with
    multiple <t> elements in a single run, as well as issues with tabs
    and linebreaks that were being lost when interspersed with text.
    This covers Issue 458 and re-fixes Issue 467 in a better way.
    
    Additional fixes were to Issue 473 and an unfiled problem with
    entities in deleted text that weren't being re-escaped in target
    output.
    
    This has caused some changes to placeholder creation in segments.
    

    → <<cset 2445b887857a>>

  7. Log in to comment