Some unicode characters trigger InlineText file is de-synchronized error while others don't

ysavourel

Now that one is an interesting one.

The issue is caused by the code that tries to auto-detect the encoding of the file.

It happens that the file is in UTF-8 without BOM and starts with an 'L'. After trying to detect the various BOM sequence the code starts to try to guess the encoding. It happens that the 'L' corresponds to a '<' in EBCDIC and that triggers the logic to go down that path. At the end the code has seen nothing to contradict that assumption (because there is an extended character in UTF-8 after the L: the bytes look like an EBCDIC pattern). And the conclusion of the encoding detector is that the file is encoded in CP037.

CP037 is not ASCII-based and therefore the linebreak are interpreted differently, so you end up with the first line of the file being read until the end of the file and causing the desynchronization error.

@oraveczcsaba: There are a few choices for the workaround:

Add a BOM to the .mos file, so it's detected as UTF-8,
Or to convert the .mos file to UTF-16 (maybe easier on Linux),
Or replace the curly apostrophe by an ASCII one.

@jhargrave, @tingley, @mnita_google: The long term (and simplest) solution is probably for us to get rid of the code trying to detect EBCDIC encoding. I doubt we process many files in EBCDIC.

2017-12-08T16:05:39+00:00

ysavourel

Actually, another workaround is to move the <g id="1"> code at the front (before the 'L') as this is where it should be: the whole sentence is bold in English.

2017-12-08T16:27:53+00:00

Mihai Nita

I agree, I don't think anybody uses EBCDIC these days :-)

Also, looking at the command line I see that the encoding is specified for both input and output (-ie utf8 -oe utf8) So I would really expect no detection at all... But that does not apply to the .mos file... Hmmm...