Some unicode characters trigger InlineText file is de-synchronized error while others don't

Issue #660 resolved
Csaba Oravecz created an issue

We get a weird error when trying to merge source formatting into a translated document with tikal with the command:

tikal.sh -lm -fc okf_openxml -sl en  -seg ./tikal/config/okapi_default_icu4j.srx -ie utf8 -oe utf8 -overtrg -from fr.mos data.docx

A character sequence like in the enclosed fr.mos file:

Lā€™<g id="1"> exploration ...

triggers the error if the <g> tag is preceded by some characters (like U2019 in second position here) but characters in lower ranges (<U0800) seem to be safe. There must be some reason behind, and any help would be greatly appreciated.

Comments (7)

  1. YvesS

    Now that one is an interesting one.

    The issue is caused by the code that tries to auto-detect the encoding of the file.

    It happens that the file is in UTF-8 without BOM and starts with an 'L'. After trying to detect the various BOM sequence the code starts to try to guess the encoding. It happens that the 'L' corresponds to a '<' in EBCDIC and that triggers the logic to go down that path. At the end the code has seen nothing to contradict that assumption (because there is an extended character in UTF-8 after the L: the bytes look like an EBCDIC pattern). And the conclusion of the encoding detector is that the file is encoded in CP037.

    CP037 is not ASCII-based and therefore the linebreak are interpreted differently, so you end up with the first line of the file being read until the end of the file and causing the desynchronization error.

    @oraveczcsaba: There are a few choices for the workaround:

    • Add a BOM to the .mos file, so it's detected as UTF-8,
    • Or to convert the .mos file to UTF-16 (maybe easier on Linux),
    • Or replace the curly apostrophe by an ASCII one.

    @jhargrave, @tingley, @mnita_google: The long term (and simplest) solution is probably for us to get rid of the code trying to detect EBCDIC encoding. I doubt we process many files in EBCDIC.

  2. YvesS

    Actually, another workaround is to move the <g id="1"> code at the front (before the 'L') as this is where it should be: the whole sentence is bold in English.

  3. Mihai Nita

    I agree, I don't think anybody uses EBCDIC these days :-)

    Also, looking at the command line I see that the encoding is specified for both input and output (-ie utf8 -oe utf8) So I would really expect no detection at all... But that does not apply to the .mos file... Hmmm...

  4. Mihai Nita

    Maybe we can add a UTF-8 detection before trying EBCDIC at all...

    Although after reading the code I admit this is a rare one... :-) Start with L and followed by 3 bytes > 80h How often does that happen...

  5. Chase Tingley

    I am fine with either using Mihai's suggestion of UTF-8 detection first, or with removing EBCDIC support.

  6. Oytun Tez

    So funny, we keep encountering this with our French plain text documents! :) We will try adding BOM to the documents if nonexisting.

    Thanks for the insights!

  7. Log in to comment