Invalid DOCX files created with Moses InlineText tag rearranging and round-trip
Original issue 176 created by Achi... on 2011-07-04T20:35:41.000Z:
Steps to reproduce:
1. Create a new Word document in Microsoft Word 2007 with the text
"This is page ."
2. Position the cursor before the period and choose Insert/Page Number/Current Position/Plain Number: the number 1 is inserted
3. Save the document as test.docx
4. tikal.bat -xm test.docx -sl en
The first line of the resulting test.docx.en is:
This is page <x id="1"/><g id="2">1</g><x id="3"/>.
- Edit the first line to read:
This is page <x id="3"/><x id="1"/><g id="2">1</g>. - Save as test.docx.fr
- tikal.bat -lm test.docx -totrg -from test.docx.fr
- Open the resulting test.out.docx in Microsoft Word
Result:
Word cannot open the file: "The file test.out.docx cannot be opened because there are problems with the contents."
Details:
"The name in the end tag of the element must match the element type in the start tag." Location: Part: /word/document.xml [...]
Remark:
This kind of tag rearranging, while a bit non-sensical in the example, is happening often in longer segments during translation/machine translation.
Analysis of DOCX XML:
9. Extract contents of test.out.docx with extraction program (e.g. 7zip)
10. View file test.out.docx/word/document.xml
Invalid XML: Closing tag </w:fldSimple> appears before opening tag <w:fldSimple ...>
Comments (5)
-
Account Deleted -
Account Deleted Comment [2.](https://code.google.com/p/okapi/issues/detail?id=176#c2) originally posted by Achi... on 2011-07-05T13:44:36.000Z:
In this case the preferred behavior for me would be: 1. the filter warns that invalid XML is output and 2. the filter escapes or deleted the invalid XML (could be a filter option)
-
Account Deleted - attached file.docx
- attached bug_the_open_in_office2007.JPG
- attached aftertest
- attached file.out.docx
Comment 4. originally posted by bailo... on 2013-11-22T10:06:32.000Z:
I'm an user of Okapi and I had a trouble while opening documents in Microsoft Office 2007: File.docx . The error message is: /Word / document.xml line 6294 colums 6293. The problem doesn't exist in OpenOffice, there is no problems in the file.docx (I can open the file without any error message)
tikal.sh -lm file.docx -totrg -from aftertest
-
- edited description
- removed responsible
-
- changed status to wontfix
Low priority
- Log in to comment
Comment [1.](https://code.google.com/p/okapi/issues/detail?id=176#c1) originally posted by @ysavourel on 2011-07-05T03:16:57.000Z:
I can reproduce the issue. Extracting to XLIFf with <bpt>/<ept> shows the DOCX codes:
<ph id="1"><w:fldSimple w:instr=" PAGE
\* MERGEFORMAT "></ph> ... <ph id="3"></w:fldSimple></ph>
Ideally those two placeholders would be paired tags. But that is difficult to achieve with Word.