XLIFF Splitter Step: normalise line endings on writing

Issue #1302 resolved
Denis Konovalyenko created an issue

This is a follow-up for pull request #712.

Jim Hargrave Work

[main] WARN net.sf.okapi.steps.xliffsplitter.XliffJoinerTest - Differences between C:\Users\fooba\git\okapi\okapi\steps\xliffsplitter\target\test-classes\net\sf\okapi\steps\xliffsplitter\multiple_files.xlf and C:\Users\fooba\git\okapi\okapi\steps\xliffsplitter\target\test-classes\out\net\sf\okapi\steps\xliffsplitter\to_join_multiple_files\multiple_files_CONCAT.xlf: 
[main] WARN net.sf.okapi.steps.xliffsplitter.XliffJoinerTest - - Expected text value '


' but was '


' - comparing <body ...>


</body> at /xliff[1]/file[1]/body[1]/text()[1] to <body ...>


</body> at /xliff[1]/file[1]/body[1]/text()[1] (DIFFERENT) 
[main] WARN net.sf.okapi.steps.xliffsplitter.XliffJoinerTest - - Expected text value '


' but was '


' - comparing <body ...>


</body> at /xliff[1]/file[2]/body[1]/text()[5] to <body ...>


</body> at /xliff[1]/file[2]/body[1]/text()[5] (DIFFERENT) 

java.lang.AssertionError
    at org.junit.Assert.fail(Assert.java:87)
    at org.junit.Assert.fail(Assert.java:96)
    at net.sf.okapi.steps.xliffsplitter.XmlDocumentsComparison.compareXML(XmlDocumentsComparison.java:70)
    at net.sf.okapi.steps.xliffsplitter.XmlDocumentsComparison.compareWithGold(XmlDocumentsComparison.java:56)
    at net.sf.okapi.steps.xliffsplitter.XliffJoinerTest.joinXliffContainingMultipleFileElementsSplitIntoMultipleParts(XliffJoinerTest.java:83)

Maybe whitespace normalization?

Denis Konovalyenko

@jhargrave-straker thank you for the error log!

The splitter and joiner seem to be using the net.sf.okapi.common.BOMNewlineEncodingDetector for the line ending detection. So, the line ending is written as

&#xd;

on the Windows platform or when a document has a CRLF ending.

If we take a look at the spec, we can find that

the XML processor MUST behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating both the two-character sequence #xD #xA and any #xD that is not followed by #xA to a single #xA character.

Also, Wodstox and Java internal Stax implementations behave a bit differently on writing character events (and this is acceptable as far as I can understand).

So, a solution would be to fixate the line ending and make it implicit - LF (#xA character).

Comments (2)

  1. Log in to comment