OpenXml filter adds tags around nnbsp by default

Issue #1200 new
Manuel Souto Pico created an issue

Request

Add option to Okapi XML filter to put tags around narrow no-break spaces ( ).

Default value of that option should be: FALSE (no tags), that is, the behaviour up to version 1.13-1.45.0 of the OmegaT plugin.

Background

After the release of the new version of the Okapi plugin for OmegaT (i.e. okapiFiltersForOmegaT-1.13-1.45.0.jar) I have noticed an inconsistent behaviour with the previous version that breaks translations: with the new version, non-breaking spaces seem to be embedded in paired tags.

So for example, in the project attached, the translators (which had the okapiFiltersForOmegaT-1.12-1.44.0.jar installed) produced you can see the following two translations:

  1. Likes chicken, hates water 
    Mahilig sa manok, ayaw sa tubig
  2. If you see her, PLEASE call Diep at <g1>092 432 1234</g1>
    Kung makita nyo siya, PAKIUSAP tumawag sa Jeff at 0912 345 6789<g1></g1>

After updating the plugin to version to 1.13-1.45, if I open the project, those two translations are lost, because new tags (in red below) appear in the source text:

  1. Likes chicken, hates water<g1></g1>
  2. If you see her, PLEASE call Diep at <g1></g1><g2>092 432 1234</g2>

These are Word files, so when I hover over the tags to see what they stand for, all I see is <run1> </run1>

This is just a sample, but this problem seems to affect more than 30% of the segments in the project (252 out of 800).

I can see the new tags appear around a narrow no-break space, e.g. "Likes chicken, hates water "

I cannot start using the new version of the plugin if this issue happens, as many segments will become untranslated.

Rationale for the request

Narrow no-break spaces are already exposed as part of the text, so I don't think the filter must use tags (or do anything special) to "expose" narrow no-break spaces in this way.

As far as I understand, tags stand for inline codes, not for regular characters (including any kind of whitespace). A narrow no-break space is just a space with some special properties: it does not indicate word boundary and its width is thin. In other words, in any case it's just a character, not an inline code.

It would be the translation editor's job (not the filter's) to let the user highlight different kinds of whitespace, as well as other invisible characters, if that's what the user wants. OmegaT does that with whitespace, non-breaking spaces, bidi markers, etc. We can add it for narrow no-break spaces too.

Apart from the fact that this change breaks backwards compatibility (so it's not possible to open existing OmegaT projects without potentially losing the translations of many segments), there are other reasons why inserting tags around characters seems a bad idea, namely:

Whatever characters (including different kinds of whitespace) are used in the source text are there for reasons that make sense only to the source language's spelling rules or editorial policies, typographical conventions, etc. of the source text, but not necessarily the target language's. Characters found in the source text might not (and often are not) required in the target language. In other words, unless there are specific requirements, often characters in the source segment are irrelevant with regards to what characters must be used in the translation.

For example, narrow no-break spaces might be used as thousand separators in figures, to separate double punctuation in French, between multipart abbreviations in German, to separate suffixes in Mongolian, etc. E.g.

  • 100 000 dollars.
  • On utilise des espaces « insécables » pour la double ponctuation.
  • Alle arten von boxen, z. B. geschenk-boxen.

(Using hex entities that would be:

  • 100 000 dollars.
  • On utilise des espaces « insécables » pour la double ponctuation.
  • Alle arten von boxen, z. B. geschenk-boxen.)

Incidentally, if I use the new filter to extract those three sentences, in none of them are the narrow no-break spaces surrounded by tags. It seems to happen only with trailing narrow no-break spaces or in a very particular context (before a new run).

None of those narrow no-break spaces found in the source text might be required in the target text, because the target language might have different spelling rules or because the wording in the target language does not require no-break spaces at all, e.g. in Spanish:

  • 100.000 dólares
  • Los espacios "no separables" se utilizan para puntuación doble.
  • Todo tipo de cajas, como cajas de regalo.

Also, sometimes these narrow no-break spaces are not intentional, they are there because the authoring editor inserts them without the user knowing about it. Then the text is copied and pasted in another format (Word, in this case) for translation, all those invisible characters are carried over with the text and pollute the document unless there's a clean-up. I think that was the case with the document where I found this problem now while testing the new filter.

For all those reasons I believe the correct behavior is what the previous version(s) of the filter did.

Even assuming that the narrow no-break space must be transferred to the translation and somehow must be handled as an inline code, those tags around the space do not help. The user can insert the tags, but that does not carry over the space with them. The narrow no-break space would have to be inserted as a separate action, regardless of whether there were paired tags around it or not. It would be different if there was a standalone tag that stands for the space itself, but that's not the case.

In the unlikely but possible event where the narrow no-break space is expected to be maintained in the target text also (and should not be "accidentally deleted"), I would say that's a bad design. It's the source content developer's job to separate text from layout. But in any case, it shouldn't be done with tags, which would be confusing for the user and hampering the translation process.

Comments (6)

  1. jhargrave-straker

    @Denis Konovalyenko Can you take a look at this one? It causes a lot of problems with translation and TM matching.

  2. Denis Konovalyenko

    @Manuel Souto Pico

    The rules for cleaning up the non-complex script and complex script properties were reconsidered:

    The non-complex (b, i, sz) script and complex script (bCs, iCs, szCs) run properties clarification is based on the presence of detected font categories.

    The bPreferenceAggressiveCleanup parameter set to true should improve the consequential runs merge - the detected run font categories are used to distinguish the merge opportunities.

  3. yaowenjun
    Original text: See 111 1-1 has a link and was replaced with a <x> tag after being converted to xlf. 
    <body>
    <trans-unit id="NFDBB2FA9-tu1" xml:space="preserve">
    <source xml:lang="en">see <x id="1"/></source>
    <seg-source><mrk mid="0" mtype="seg">see <x id="1"/></mrk></seg-source>
    <target xml:lang="zh"><mrk mid="0" mtype="seg">see <x id="1"/></mrk></target>
    </trans-unit>
    </body>
    
    How can it not be replaced
    

    @Denis Konovalyenko @Manuel Souto Pico @Jim Hargrave

  4. Log in to comment