HTML Filter: RTL (Arabic) texts are not encoded to UTF-8 in the merged file

Issue #1075 resolved
Handika Dwi created an issue

Is this a right thing on Okapi or HTML in general?
I don’t have deep knowledge about RTL in HTML anyway

Comments (6)

  1. Chase Tingley

    First of all, if you open the document in a browser, you’ll see that the Arabic renders fine. Numeric entity escaping is a valid notation.

    The escaping is happening because somehow the meta charset tag in your file switched to US-ASCII. I am not sure what did this, but it is not the default behavior of the HTML filter.

    Using the attached XLIFF with dummy (machine translation) Arabic targets, I can merge a target file using tikal and the default html config that is in UTF-8.

  2. Chase Tingley

    I don’t think so. I merged using your FPRM using tikal and it was still got the same result. However, I can see the same result if I run tikal with -oe US-ASCII to force the output encoding. How are you calling Okapi?

  3. Log in to comment