OpenXML: Text runs containing multiple text fragments + tabs lose content on merge
Testcase attached. This word file contains a slightly odd structure that more recent versions of Office may not produce any more (or it may be produced by LibreOffice or some other tool, I'm not sure). It has a run that contains multiple <w:t> fragments interspersed with <w:tab>. This is legal, but it breaks our filter. When you roundtrip this, the first "TEST" word is lost.
tikal.sh -fc okf_openxml -x tabs.docx
tikal.sh -fc okf_openxml -m tabs.docx.xlf
Comments (8)
-
reporter -
- attached okapi_bug.docx
Bad docx
-
- attached okapi_bug.docx.xlf
-
I faced the same problem and I think that the issue is caused by the fact that multiple <w:t> are present in a <w:r> tag, so the <w:tab> has not affect on it.
I attached a simple docx example and the xlf output retrieved from tikal using this command:
./tikal.sh -x /home/ubuntu/Desktop/germanDocs/okapi_bug.docx -sl de -tl en
Thanks!
-
reporter Ah, your theory makes sense Luciano. I suspect that the OpenXMLContentFilter code is only expecting there to be one <w:t> in a <w:r>, since it is uncommon for there to be more than one.
-
Yes Chase, I think that's the point.
As you said it's not common to have more than one <w:t>, indeed our bad docx came from an OCR service.
-
reporter I've gone deep enough into this that I can confirm it's the same issue. The culprit is OpenXMLContentFilter.combineRepeatedFormat(), which is a nasty piece of work. I am working to figure out the best way to deal with this.
-
reporter - changed status to resolved
Fix Issue 458, Fix Issue 467, and Fix Issue 473 in the openxml filter
This rewrites OpenXMLContentFilter.combineRepeatedFormat() and splits out the markup simplification content to a new class called ParagraphSimplifier. This fixes many issues in the old code with multiple <t> elements in a single run, as well as issues with tabs and linebreaks that were being lost when interspersed with text. This covers Issue 458 and re-fixes Issue 467 in a better way. Additional fixes were to Issue 473 and an unfiled problem with entities in deleted text that weren't being re-escaped in target output. This has caused some changes to placeholder creation in segments.
→ <<cset 2445b887857a>>
- Log in to comment
As a side note, I built this testcase by hand, based on an example seen in the wild, and if you save it out again in Office, the structure that causes problems goes away. So be careful not to do that if testing this bug.