RawDocument from CharSequence gets the wrong endianess on LE platforms (and corrupts Strings)

Issue #862 new
Mihai Nita created an issue
    static void badEncodingOnLittleEndianPlatforms() {
        String source = "foo = First message.\r\nbar = Second message.";
        try (RawDocument rawDoc = new RawDocument(source, LocaleId.US_ENGLISH);
             PropertiesFilter filter = new PropertiesFilter()) {

            filter.open(rawDoc, true);

            Log.w("Okapi", rawDoc.getEncoding()); // UTF-16
            while (filter.hasNext()) {
                Event event = filter.next();
                if (event.isTextUnit()) {
                    Log.w("Okapi", event.getTextUnit().toString());
                }
            }
        }
    }

The output is:
㴀 䘀椀爀猀琀 洀攀猀猀愀最攀⸀ഀ਀戀愀爀 㴀 匀攀挀漀渀搀 洀攀猀猀愀最攀⸀

Escaped:

\u3D00\u2000\u4600\u6900\u7200\u7300\u7400\u2000\u6D00\u6500
\u7300\u7300\u6100\u6700\u6500\u2E00\u0D00\u0A00\u6200\u6100
\u7200\u2000\u3D00\u2000\u5300\u6500\u6300\u6F00\u6E00\u6400
\u2000\u6D00\u6500\u7300\u7300\u6100\u6700\u6500\u2E00

It is visibly and endianess problem. If we swap bytes:

\u003D\u0020\u0046\u0069\u0072\u0073\u0074\u0020\u006D\u0065
\u0073\u0073\u0061\u0067\u0065\u002E\u000D\u000A\u0062\u0061
\u0072\u0020\u003D\u0020\u0053\u0065\u0063\u006F\u006E\u0064
\u0020\u006D\u0065\u0073\u0073\u0061\u0067\u0065\u002E

And unescape we get:
= First message.\r\nbar = Second message.

Comments (2)

  1. Mihai Nita reporter

    Taking Properties out of the equation, but still doing a bit of what it does:

    static void badEncodingOnLittleEndianPlatforms() {
            String source = "foo = First message.\r\nbar = Second message.";
            try (RawDocument input = new RawDocument(source, LocaleId.US_ENGLISH)) {
                BOMNewlineEncodingDetector detector =
                        new BOMNewlineEncodingDetector(input.getStream(), input.getEncoding());
                detector.detectAndRemoveBom();
                input.setEncoding(detector.getEncoding()); // UTF-16LE
                String encoding = input.getEncoding(); // UTF-16
    
                BufferedReader reader = new BufferedReader(
                        new InputStreamReader(detector.getInputStream(), encoding));
                String line = reader.readLine();
                Log.e("Okapi1", line);
            } catch (IOException e) {
                e.printStackTrace();
            }
        }
    

    The detector works correctly, the detector.getEncoding() in line 6 returns UTF-16LE on Android and UTF-16BE on a PC.

    The input.setEncoding in line 6 has no effect (only loggs "Cannot reset an encoding on a CharSequence input in RawDocument")

    Changing the encoding in line 9 to UTF-16LE solves the problem.

    Changing detector.detectAndRemoveBom(); in line 5 to detector.detectBom(); also solves the problem.

    This seems to be because according to StandardCharsets.UTF_16 the endianess in this case is “byte order identified by an optional byte-order mark”

    If we don’t detectAndRemoveBom then the BOM in the string will help the InputStream get the correct UTF-16 “flavor”

  2. Log in to comment