XLIFF Splitter Step: normalise line endings on writing
This is a follow-up for pull request #712.
Jim Hargrave Work
[main] WARN net.sf.okapi.steps.xliffsplitter.XliffJoinerTest - Differences between C:\Users\fooba\git\okapi\okapi\steps\xliffsplitter\target\test-classes\net\sf\okapi\steps\xliffsplitter\multiple_files.xlf and C:\Users\fooba\git\okapi\okapi\steps\xliffsplitter\target\test-classes\out\net\sf\okapi\steps\xliffsplitter\to_join_multiple_files\multiple_files_CONCAT.xlf:
[main] WARN net.sf.okapi.steps.xliffsplitter.XliffJoinerTest - - Expected text value '
' but was '
' - comparing <body ...>
</body> at /xliff[1]/file[1]/body[1]/text()[1] to <body ...>
</body> at /xliff[1]/file[1]/body[1]/text()[1] (DIFFERENT)
[main] WARN net.sf.okapi.steps.xliffsplitter.XliffJoinerTest - - Expected text value '
' but was '
' - comparing <body ...>
</body> at /xliff[1]/file[2]/body[1]/text()[5] to <body ...>
</body> at /xliff[1]/file[2]/body[1]/text()[5] (DIFFERENT)
java.lang.AssertionError
at org.junit.Assert.fail(Assert.java:87)
at org.junit.Assert.fail(Assert.java:96)
at net.sf.okapi.steps.xliffsplitter.XmlDocumentsComparison.compareXML(XmlDocumentsComparison.java:70)
at net.sf.okapi.steps.xliffsplitter.XmlDocumentsComparison.compareWithGold(XmlDocumentsComparison.java:56)
at net.sf.okapi.steps.xliffsplitter.XliffJoinerTest.joinXliffContainingMultipleFileElementsSplitIntoMultipleParts(XliffJoinerTest.java:83)
Maybe whitespace normalization?
Denis Konovalyenko
@jhargrave-straker thank you for the error log!
The splitter and joiner seem to be using the net.sf.okapi.common.BOMNewlineEncodingDetector
for the line ending detection. So, the line ending is written as

on the Windows platform or when a document has a CRLF ending.
If we take a look at the spec, we can find that
the XML processor MUST behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.
Also, Wodstox and Java internal Stax implementations behave a bit differently on writing character events (and this is acceptable as far as I can understand).
So, a solution would be to fixate the line ending and make it implicit - LF (#xA character).
Comments (2)
-
reporter -
reporter - changed status to resolved
Pull request #714 was merged.
- Log in to comment
A related pull request #714 was opened.