RawDocument from CharSequence gets the wrong endianess on LE platforms (and corrupts Strings)
Issue #862
new
static void badEncodingOnLittleEndianPlatforms() {
String source = "foo = First message.\r\nbar = Second message.";
try (RawDocument rawDoc = new RawDocument(source, LocaleId.US_ENGLISH);
PropertiesFilter filter = new PropertiesFilter()) {
filter.open(rawDoc, true);
Log.w("Okapi", rawDoc.getEncoding()); // UTF-16
while (filter.hasNext()) {
Event event = filter.next();
if (event.isTextUnit()) {
Log.w("Okapi", event.getTextUnit().toString());
}
}
}
}
The output is:
㴀 䘀椀爀猀琀 洀攀猀猀愀最攀⸀ഀ戀愀爀 㴀 匀攀挀漀渀搀 洀攀猀猀愀最攀⸀
Escaped:
\u3D00\u2000\u4600\u6900\u7200\u7300\u7400\u2000\u6D00\u6500
\u7300\u7300\u6100\u6700\u6500\u2E00\u0D00\u0A00\u6200\u6100
\u7200\u2000\u3D00\u2000\u5300\u6500\u6300\u6F00\u6E00\u6400
\u2000\u6D00\u6500\u7300\u7300\u6100\u6700\u6500\u2E00
It is visibly and endianess problem. If we swap bytes:
\u003D\u0020\u0046\u0069\u0072\u0073\u0074\u0020\u006D\u0065
\u0073\u0073\u0061\u0067\u0065\u002E\u000D\u000A\u0062\u0061
\u0072\u0020\u003D\u0020\u0053\u0065\u0063\u006F\u006E\u0064
\u0020\u006D\u0065\u0073\u0073\u0061\u0067\u0065\u002E
And unescape we get:
= First message.\r\nbar = Second message.
Comments (2)
-
reporter -
reporter - marked as minor
- Log in to comment
Taking Properties out of the equation, but still doing a bit of what it does:
The detector works correctly, the
detector.getEncoding()
in line 6 returnsUTF-16LE
on Android andUTF-16BE
on a PC.The
input.setEncoding
in line 6 has no effect (only loggs "Cannot reset an encoding on a CharSequence input in RawDocument")Changing the encoding in line 9 to
UTF-16LE
solves the problem.Changing
detector.detectAndRemoveBom();
in line 5 todetector.detectBom();
also solves the problem.This seems to be because according to StandardCharsets.UTF_16 the endianess in this case is “byte order identified by an optional byte-order mark”
If we don’t
detectAndRemoveBom
then theBOM
in the string will help theInputStream
get the correctUTF-16
“flavor”