XML filter: wrong/unwanted encoding/decoding of entities

Issue #705 new
Former user created an issue

I am using Okapi Rainbow and it's XML filter.

My XML source file looks like that:

<L0x0x19_ToolTip Text="User defined name for the unit. The value&#xD;&#xA;is active after a restart of the unit.&#xD;&#xA;&#xD;&#xA;Factory setting: exact  type description of the current unit" LastChanged="2012.12.13" />

The xliff output file created by Rainbow (OmegaT Project) using the xml filter looks like that:

<source xml:lang="en-us">User defined name for the unit. The value&#13; is active after a restart of the unit.&#13; &#13; Factory setting: exact type description of the current unit</source>

As you can see, notations like &#xD; and &#xA; are converted to &#13;

But I definetely need the notations to stay as they are. I tried several filter options as well as using the code finder rule to convert those notations into tags -- with no success.

It seems, as if the filter does not apply the code finder rules at all or the code finder rules are processed after some replacement of the entities have taken place.

Also the code finder rule syntax does not seem to be the problem, since other test replacements for simple words work fine.

I ran an XML validator on the file to eliminate other possible causes.

I also tried to use the Encoding Conversion Step in a custom pipeline, but this does not help me on this issue.

Here is my filter configuration:

<?xml version="1.0" encoding="UTF-8" standalone="no"?><its:rules xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsx="http://www.w3.org/2008/12/its-extensions" xmlns:okp="okapi-framework:xmlfilter-options" xmlns:xlink="http://www.w3.org/1999/xlink" version="1.0">
<!-- See ITS specification at: http://www.w3.org/TR/its/ -->
<its:translateRule selector="*//@Text" translate="yes"/>
<its:translateRule selector="*//@LbaText" translate="yes"/>
<its:translateRule selector="//*" itsx:whiteSpaces="preserve" translate="yes"/>
<okp:codeFinder useCodeFinder="yes">#v1
count.i=1
rule0=&amp;#?.+?;
 </okp:codeFinder>
</its:rules>

I also checked https://okapiframework.org/wiki/index.php?title=XML_Filter#lineBreakAsCode for help or options, but with no success.

It looks like a bug. Or am I doing something wrong?

Comments (3)

  1. YvesS

    I can reproduce the behavior.

    First, regardless of the filter's behavior, you should be aware (or make aware the people who have designed this XML file) that putting translatable text in attributes is strongly discouraged, because of many reasons. See https://www.w3.org/TR/xml-i18n-bp/#DevAttributes for more details.

    One of the issues is actually the fact that whitespace (including line-breaks) have complicated normalization rules (See https://www.w3.org/TR/REC-xml/#AVNormalize), and there is no certainty that they will survive some parsers. I would recommend to change the format to store the translatable text in elements, if you can (but obviously that may not be a choice you have).

    Now, for the XML filter. If you set the whitespace to be preserved the line-feed (\n) is actually in the XLIFF file (as a line-break) and get converted back to a non-escaped line-feed. So, you could do a search and replace of &#13;\n by &#x0D;&#x0A; to get back the original notation values.

    Note that <its:translateRule selector="//*" itsx:whiteSpaces="preserve" translate="yes"/> does not preserve the whitespace in attributes (the selector here is for all elements, and that does not include their attributes).

    Instead use put the flag in the attribute rule: <its:translateRule selector="*//@Text" itsx:whiteSpaces="preserve" translate="yes"/>

    As for a better solution than the search and replace:

    There is a escapeLinebreak option in the XML encoder that the filter use that seems to be turned on for some other XML-based format. But it's not for the XML Filter itself, and, as far as I know there is no option to tun that flag on in the rules. One solution would be to add that option to the filter.

    -ys

  2. Sebastian Ebert

    Thanks for your help and workaround suggestions!

    Indeed I am not able to move the translatable text from the attributes into the elements. That's a restriction I will have to live with. But I corrected the Whitespace Rule so that it's applied to the attributes as well.

    Unfortunately when doing the postprocessing in Rainbow, it does not output &#13;\n, but real Line Feeds instead. That's why I can't do the substitution afterwards.

    However your suggestion brought me to a different solution: In a first step I use a search & replace on the raw document (without using the filter). I replace &#x0D;&#x0A; by [CRLF]. In a second step, I process the new files with the XML filter and a code finder rule that replaces [CRLF] with a tag. After the translation process and post processing in Rainbow, I do another search and replace the other way around. This works fine, as long as &#x0D;&#x0A; is only used within the translatable text in the source XML file.

    Does it make sense to create a feature request for enabling escapeLinebreak on the XML filter?

  3. Log in to comment