Emoji disappears when converted from XLIF to DOCX

Issue #680 resolved
Igor Popyk created an issue

I have a document (.docx) that has emojis. After converting this document to the XLIF format, these emojis are present, but after back conversion that XLIF to the DOCX format, the emojis disappear. In this version: https://bitbucket.org/okapiframework/okapi/branch/xliff2-improvment everything works correctly. Is it possible to add changes from this branch to the master? The file with emojis has been attached. I would expect answers and thank you for the help!

Comments (12)

  1. Chase Tingley

    Hi Igor, thanks for the report and the easy testcase. I can confirm this on 0.36-SNAPSHOT. The extracted XLIFF looks fine but we lose the emoji during merge.

    The tikal output during merge is suspicious:

    Filter configuration: okf_openxml
    XLIFF: test-file-3.docx.xlf
    Output: test-file-3.out.docx
    Input: /home/tingley/Downloads/test-file-3.docx.xlf
    Error: Text Unit source mismatch during merge: Original id="NFDBB2FA9-tu1" target id="NFDBB2FA9-tu1"
    Original Source="Gifts🎁,diamonds💎andcash💸"
    Translated Source="Gifts,diamondsandcash"
    

    I have a suspicion we're not reading the emoji correctly when we parse the XLIFF back in, but that needs to be confirmed.

    I'm clearing the milestone field (that's to indicate what version the fix occurs in).

  2. Jim Hargrave (OLD)

    I suspect XmlInputStreamReader may be the problem - this was working up till m33. This is one we would like to fix before the M36 release.

  3. Jim Hargrave (OLD)

    I have a feature branch that may fix this would appreciate a code review (branch feature/Issue_#680). Note that that same branch I had to disable a markdown test because it was failing.

  4. Chase Tingley

    @jimhargrave That code looks good to me and I can see it fixes Igor's original case. Do you want to open a formal PR, or can I just merge it? I'm going to add the markdown test back in, it's a Windows-only failure, I think. I will talk to Kuro about it.

  5. Jim Hargrave (OLD)

    Yves would like to change the raw emoji characters to a Unicode escape sequence to protect against encoding differences across platforms.

  6. Log in to comment