XML filter does not extract resname if unit contains HTML block-level tags

Issue #1309 open
Manuel Souto Pico created an issue

I have a project where I need to translate ResX files and extracting the resname (name attribute) is a requirement in this project.

I am using the following (slightly tweaked) ResX filter parameters file (okf_xml@foo.fprm) to extract translatable content from a ResX file (but I had the same issue with other more generic XML filters):

<?xml version="1.0" encoding="UTF-8" standalone="no"?><its:rules xmlns:its="http://www.w3.org/2005/11/its" xmlns:itsx="http://www.w3.org/2008/12/its-extensions" xmlns:okp="okapi-framework:xmlfilter-options" xmlns:xlink="http://www.w3.org/1999/xlink" its:translate="no" version="1.0">

<!-- This is a set of rules to process .ResX files. Be aware that any text in Base64 entries
like text items declared in a list box, will not be processed using this method.
These rules can also be used on simple .resx files that have just string entries.  -->

 <its:translateRule selector="/root" translate="no"/>
 <its:translateRule itsx:idValue="../@name" selector="//data[not(@type) and not(starts-with(@name, '&gt;'))]/value" translate="yes"/>
 <its:translateRule itsx:idValue="../@name" selector="//data[starts-with(@name, 'SurveyQuestions.CognitiveAbilityTest.QuestionInstructions.Num_0')]/value" translate="no"/>
 <its:translateRule selector="//data[@mimetype]/value" translate="no"/>
 <its:translateRule selector="//data[substring(@name, string-length(@name) - string-length('.FieldName')+1)='.FieldName']/value" translate="no"/>
 <its:translateRule itsx:idValue="../@name" selector="//data[@name='$this.Text']/value" translate="yes"/>

 <!-- Localization notes -->
 <its:locNoteRule locNotePointer="../comment" locNoteType="description" selector="//data[not(@type) and not(starts-with(@name, '&gt;') or starts-with(@name, '$'))]/value"/>

<okp:codeFinder useCodeFinder="yes">#v1
count.i=3
rule0=(\{[^}\n]+?\})
rule1=\&lt;(/?)\w+[^&gt;]*?&gt;
rule2=\"(?![:,\n])([^\"]*?)\"(?=:)
</okp:codeFinder>
</its:rules>

When a <value> node contains only text, it works well. For example, given this data:

<data name="foo.bar" xml:space="preserve">
    <value>This is a sentence. This is another sentence.</value>
</data>

the extraction produces:

  • This is a sentence. [with resname foo.bar_0]
  • This is another sentence. [with resname foo.bar_0]

However, given the following data with inline HTML tags:

<data name="foo.baz" xml:space="preserve">
    <value><ul><li>This is one sentence. This is another sentence.</li></ul></value>
</data>

the extraction produces the same segments but does not extract the resname attribute.

The ITSFilter class in Okapi must be doing something weird when extracting segment info.

I get this issue when using the XML (ResX) filter as part of the Okapi plugin for OmegaT. I attach a sample OmegaT project.

However, if it can be of interest to Okapi developers, @Briac Pilpré made a branch https://bitbucket.org/briacp/omegat-plugin-jdk8-patched/branch/fix/capstan/incorrect-name-with-tags where he wrote a test ITSFilterTest (xml, fprm) that showcases the behaviour. Since it's a unit test, it's self-contained and so doesn't need a running OmegaT or a specific project.

Comments (6)

  1. Manuel Souto Pico reporter

    Here comes an update that @Briac Pilpré shared with me, about where and why this bug appears.

    When adding a new segment in net.sf.okapi.filters.its.ITSFilter.addTextUnit(Node, boolean, TextFragment), the context node passed to the method processTextUnit (l.1095) is always the last one evaluated (the innermost).

    For example, given the following ITS translateRule rule:

        <its:translateRule selector="//data/value" translate="yes" itsx:idValue="../@name"/>    
    

    and the ResX file:

        <data name="SegmentInsideTag" xml:space="preserve">
            <value><li>This is a text within html tags.</li></value>
        </data>
    

    The idValue will be evaluated against the <li> node, not the <value> node as one would expect. This is also the case when trying to use its:idValueRule with the same XML content.

    This behaviour seems to belong to the core of Okapi (not the OmegaT plugin), so there's no way to guess the correct idValue inside the plugin. We need some help or advice from core Okapi devs.

    One potential approach to fix the issue could be to stop adding nodes to the ITS context deeper than what is returned by the selector, but there can be other ITS rules to be evaluated further down the line (e.g its:withinText rules).

    @Jim Hargrave do you have any suggestion?

  2. YvesS

    Obviously this may not be enough, but you may add rules like:
    <its:translateRule itsx:idValue="../../../@name" selector="//data/value/ul/li" translate="yes"/>

    in the ITS file.

    This will get you, the same output, but with the resname set.

    The obvious issue is that you may not know all the use cases if <ul><li> inside the <value>.

    Another option could be to treat the <ul> and <li> as within text. Not great either, but that would catch all cases.

  3. Log in to comment