OpenXML filter: formatting issues in Arabic translation (RTL)

Create issue
Issue #933 open
Manuel Souto Pico created an issue

I use the okf_openxml to prepare a Word file. The translated document has formatting issues (e.g. expected formatting is not there, unexpected formatting is added).

Even if you don't know Arabic, you can see the two issues in the following screenshot:

As you can see in the screenshot, part of the translation is bolded whereas the source is not. I can confirm all tags have been handled correctly in the OmegaT project. I have used the option "Remove leading and trailing tags" (in Project settings > File filters) but I doubt that has any relevance, since I have the same issues if I don't.

Sample OmegaT project and Rainbow settings attached. Thanks in advance.
Let me know if any questions about Arabic.

Official response

  • Denis Konovalyenko

    @Manuel Souto Pico , here is what I think about all that. I will start with the latter case. It looks like the Arabic font falls into the complex script fonts category and the specific formatting (w:bCs and w:iCs) for the case when “added bold and italic” applied:

                <w:r>
                    <w:rPr>
                        <w:bCs/>
                        <w:iCs/>
                        <w:rtl/>
                    </w:rPr>
                    <w:t xml:space="preserve">[نعم] لأن جنى تستطيع استخدام بعض من الزدز ويبقى لديها 13 زدز.</w:t>
                </w:r>
       ...
                 <w:r>
                    <w:rPr>
                        <w:bCs/>
                        <w:iCs/>
                        <w:rtl/>
                    </w:rPr>
                    <w:t xml:space="preserve">[لا]. </w:t>
                </w:r>
    

    Below is the corresponding place in differences:

    A possible solution would be using the bPreferenceAggressiveCleanup OpenXML filter parameter. If this option is set to true, the mentioned w:bCs, w:iCs as well as w:spacing, w:szCs and w:w run properties will be removed on processing the document. Please observe the round-tripped document diff below:

    As for the first case with “italics removed”, below are the differences of the corresponding part:

    Please notice that the w:i formatting is applied to all non-complex script characters only. Again, as we have just revealed, the Arabic content appeared to be the complex script one. That is why the italics formatting is not applied.

    I am afraid, there is no workaround for this case at the moment (other than fixing the document structure manually).

    A possible programmatic solution, I can imagine now, might consist of the following steps:

    1. The aggressive cleanup option has to be made non-applicable to the w:iCs and w:bCs run properties.
    2. The extraction has to be performed with the following clarification of the document markup. The w:iCs or w:bCs run properties have to be removed if the original content does not have the complex script characters ([\u0590-\u074F\u0780-\u07BF\u0900-\u109F\u1780-\u18AF\u200C-\u200F\u202A-\u202F\u2670-\u2671\uFB1D-\uFB4F]). Also, the w:i and w:b run properties have to be removed if the original content does have the complex script characters only.
    3. The merge has to be performed with the following clarification of the document markup. The w:iCs or w:bCs run properties have to be added if the translated run content contains complex script characters and there are only the w:i or w:b run properties specified. Similar, the w:i or w:b run properties have to be added if the translated content contains non-complex script characters and there are only the w:iCs or w:bCs run properties specified.

    cc @Chase Tingley

Comments (19)

  1. Denis Konovalyenko

    @Manuel Souto Pico , could you please clarify if the issue happens under the latest release (1.39.0) and if you have a sample document to experiment with?

    Thanks!

  2. Manuel Souto Pico reporter

    Dear @Denis Konovalyenko , thanks for your help.

    I can’t really confirm whether the issue is reproducible in version 1.39.0 because I’m unable to produce the final translated documents in that version (due to bug documented in ticket #932). That’s why I kept using version 38.

    You may check folder /original for sample source files (and the /donefor the sample target files including the issue) in the OmegaT project (see the .omt package, which is just a zip) I attached to this ticket.

  3. Denis Konovalyenko

    @Manuel Souto Pico , here is what I think about all that. I will start with the latter case. It looks like the Arabic font falls into the complex script fonts category and the specific formatting (w:bCs and w:iCs) for the case when “added bold and italic” applied:

                <w:r>
                    <w:rPr>
                        <w:bCs/>
                        <w:iCs/>
                        <w:rtl/>
                    </w:rPr>
                    <w:t xml:space="preserve">[نعم] لأن جنى تستطيع استخدام بعض من الزدز ويبقى لديها 13 زدز.</w:t>
                </w:r>
       ...
                 <w:r>
                    <w:rPr>
                        <w:bCs/>
                        <w:iCs/>
                        <w:rtl/>
                    </w:rPr>
                    <w:t xml:space="preserve">[لا]. </w:t>
                </w:r>
    

    Below is the corresponding place in differences:

    A possible solution would be using the bPreferenceAggressiveCleanup OpenXML filter parameter. If this option is set to true, the mentioned w:bCs, w:iCs as well as w:spacing, w:szCs and w:w run properties will be removed on processing the document. Please observe the round-tripped document diff below:

    As for the first case with “italics removed”, below are the differences of the corresponding part:

    Please notice that the w:i formatting is applied to all non-complex script characters only. Again, as we have just revealed, the Arabic content appeared to be the complex script one. That is why the italics formatting is not applied.

    I am afraid, there is no workaround for this case at the moment (other than fixing the document structure manually).

    A possible programmatic solution, I can imagine now, might consist of the following steps:

    1. The aggressive cleanup option has to be made non-applicable to the w:iCs and w:bCs run properties.
    2. The extraction has to be performed with the following clarification of the document markup. The w:iCs or w:bCs run properties have to be removed if the original content does not have the complex script characters ([\u0590-\u074F\u0780-\u07BF\u0900-\u109F\u1780-\u18AF\u200C-\u200F\u202A-\u202F\u2670-\u2671\uFB1D-\uFB4F]). Also, the w:i and w:b run properties have to be removed if the original content does have the complex script characters only.
    3. The merge has to be performed with the following clarification of the document markup. The w:iCs or w:bCs run properties have to be added if the translated run content contains complex script characters and there are only the w:i or w:b run properties specified. Similar, the w:i or w:b run properties have to be added if the translated content contains non-complex script characters and there are only the w:iCs or w:bCs run properties specified.

    cc @Chase Tingley

  4. Manuel Souto Pico reporter

    @Denis Konovalyenko Thank you so much for having looked into this.

    I appreciate your possible programmatic solution, which seems to be a generic fix (for any language pair). I'm not sure I understand all of it, but based on that and also based on other observations I explain below, I tend to think that the problem is not in the filter but in the source document (but it could be both).

    I have created a new file from scratch that has the same content and the same formatting as the document where the issue happens, and the formatting looks good in the target document. When I look at the document.xml both in the original and the merged files, it seems that both the regular formatting property w:i and the one for complex script w:iCs are present in both. See the following screenshot:

    Could it be that both must always be present but that w:i is the property used in the absence of w:bidi and w:rtl and, vice versa, that w:iCs is the property used when w:rtl is a sybling property and/or w:bidi is an ancestor? (and the same thing for bold and underlining).

    If that assumption is correct, then I don't understand why the original document has w:i without w:iCs or w:bCs and w:iCs without w:b and w:i respectively, but in any case it seems more a problem in the file than in the filter (at least from the point of view of the code, for the client this would be difficult to explain since the source document looks good before translation).

    If you can confirm that, then for this time I guess it would make sense to manually manipulate the file rather than waiting for a fix to the filter.

    If I understand correctly, in my particular case of a English-Arabic language pair, the (perl) replacements needed in the source file would be:

    s~(?<!<w:i/>)<w:iCs/>~~g
    s~(?<!<w:u/>)<w:uCs/>~~g
    s~(?<!<w:b/>)<w:bCs/>~~g
    

    to remove the complex-script property if not preceded by the regular formatting, and

    s~<w:([ibu])/>(?!<w:\1Cs/>)~<w:\1/><w:\1Cs/>~g
    

    to add the missing complex-script property if the regular formatting stands alone (not followed by the complex-script equivalent property).

    When I run these replacements in the source file, the target document has the expected formatting.

    I’d be grateful for some confirmation that my understanding (and the proposed replacements) makes sense and will not backfire in other scenarios.

  5. Denis Konovalyenko

    @Manuel Souto Pico , correct, the proposed solution should work for all combinations.

    Could it be that both must always be present but that w:i is the property used in the absence of w:bidi and w:rtl and, vice versa, that w:iCs is the property used when w:rtl is a sybling property and/or w:bidi is an ancestor? (and the same thing for bold and underlining).

    If that assumption is correct, then I don't understand why the original document has w:i without w:iCs or w:bCs and w:iCs without w:b and w:i respectively, but in any case it seems more a problem in the file than in the filter (at least from the point of view of the code, for the client this would be difficult to explain since the source document looks good before translation).

    As you have seen, there can be any combination of w:i and w:iCs (w:b and w:bCs) in the wild (it may even vary between software saving a document). And there is no any restriction implied on that. It is not about w:rtl or w:bidi being available (they are specified to make the text appear with the RTL alignment) but rather about the characters present and the fonts used for rendering them (you can get more information on this if you check the w:rFonts run property in the spec).

    So, I can only confirm that if there is a need to have the rendering as concise as possible, the original document must have both bold (w:b and w:bCs) or italic ( w:i and w:iCs) properties specified and it mustn’t be filtered with the bPreferenceAggressiveCleanup option set to true.

  6. Manuel Souto Pico reporter

    Thank you so much, @DenisKonovalyenko. For the time being then I won’t rely on a new version of the filter and I will try to preprocess my source files. I understand that if I use a filter configturation file, the bPreferenceAggressiveCleanup must be set to false.

    Where I can find “the spec” to read about the w:rFonts? Thanks

  7. Denis Konovalyenko

    @Manuel Souto Pico , the specification is an ECMA standard (here is the direct link). I think it can also be found somewhere in Microsoft resources (please look for “MS-OE376”). Yet another variant (has to be the same thing) is to find the spec as an ISO standard - ISO/IEC 29500-1:2016.

  8. Denis Konovalyenko

    @Manuel Souto Pico , a follow-up issue #947 was created. Would it be possible to close this one then?

    Thanks!

  9. Manuel Souto Pico reporter

    Dear @Denis Konovalenko , sorry for my late reply (with holidays and other tasks I couldn’t get back to this earlier). Yes, this ticket can be closed. However, would it be possible to raise the priority of issue #947? Please let me know whether there’s anything I can to speed it up.

    I would also like to leave some additional feedback. The hacks documented above help to produce a target Word file that has the expected formatting (bold, italics, etc.) in the running text. However, I still have issues with styles: in the target document, many headers have different styling (mostly different font size, but sometimes also bold is visually missing, although the bold icon seems active). Therefore it needs to be fixed: I change formatting of the header and update the style, and in most cases all other headers with the same style get updated too. Not sure this is part of the same issue, and whether it is something in the filter or in the files. I can reproduce it using the OmegaT OOXML filter too.

  10. Manuel Souto Pico reporter

    Update: I can confirm this problem is reproducible in version 42 as well (okapiFiltersForOmegaT-1.10-1.42.0.jar).

  11. Manuel Souto Pico reporter

    I have noticed that I can reproduce the issue (i.e. bold and italics are missing in the target document) only when the bPreferenceAggressiveCleanup.b option is set to true in the okf_openxml@foo.fprm filter configuration file of the OmegaT project. I hope that helps.

  12. Log in to comment