Limitation in the okf_html rules?

Issue #801 new
Mihai Nita created an issue

Use this in an .html file and run tikal -x on it:

<p>Some text in p element, VISIBLE.</p>
<p translate="no">Some text in p element, @translate=no, IGNORE.</p>

I would expect that the second string (containing "IGNORE") would not be extracted (because of the translate="no"), but it is.

If instead of <p> tags I use some tag not listed in the config (let's say div) it works as expected (the second string is not extracted):

<div>Some text in div element, VISIBLE.</div>
<div translate="no">Some text in div element, @translate=no, IGNORE.</div>

Comments (5)

  1. Mihai Nita reporter

    The yaml config contains 2 maps, one for attributes, one for elements.

    Let's focus on the elements one (the attribute one is similar, but there are a lot fewer entries)

    The key of the map is either the element tag, or a regexp. So they are unique. The relevant map entries are:

    p:
        ruleTypes: [TEXTUNIT]
        idAttributes: [id]
        elementType: paragraph
    
    ...
    
    .*:
        ruleTypes: [EXCLUDE]
        conditions: [translate, EQUALS, 'no']
    

    The in-memory representation contains separate maps, one for "normal" entries, and one for regexp. The lookup by html tag checks the exact map first, and the regexp map second, if nothing is found in the main map.

    Because there is a match for p the regexp entry (with rules for translate='no') is never found.

  2. Mihai Nita reporter

    The most relevant code (mostly in net.sf.okapi.filters.abstractmarkup.config.YamlConfigurationReader)

    There is almost 100% similar code for attribute rules.

    private Map<String, Object> elementRules;
    private Map<String, Object> elementRegexRules;
    private Map<String, Pattern> elementCompiledRegexRules;
    
    public Map getElementRule(String ruleName) {
        Map rule = (Map)elementRules.get(ruleName);
    
        // check our element regex patterns
        if (rule == null && !elementRegexRules.isEmpty()) {
            for (Map.Entry<String, Object> e : elementRegexRules.entrySet()) {
                Matcher m = elementCompiledRegexRules.get(e.getKey()).matcher(ruleName);
                if (m.matches()) {
                    rule = (Map) e.getValue();
                }
            }
        }
        return rule;
    }
    
    public List<Map> getRules(String ruleName) {
        List<Map> rules = new LinkedList<>();
        Map rule = getElementRule(ruleName);
        if (rule != null) {
            rules.add(rule);
        }
        rule = getAttributeRule(ruleName);
        if (rule != null) {
            rules.add(rule);
        }
    
        return rules;
    }
    

    So if we find something in the “exact” map we ignore the regexp one completely.
    Then we do the same over elements, then attributes, ending up with one rule, really.

    Meaning that because we have “exact” rules for p: or td: or li: then '.*': ruleTypes: [EXCLUDE] conditions: [translate, EQUALS, 'no'] will not match on any of these
    (and on any other element explicitly listed, which are many, and the most important)

    That is currently the only regexp rule. But other people using Okapi might already have some custom configurations with more regexp rules, I don’t know.

    Ideas:

    Please ignore “1. Quick fixes”, see my last comment 😞 That comment might even invalidate the “Big refactoring” idea. I don’t know yet.

    1. Quick fixes

    a. Change the rules to always lookup both the “exact” and the “regexp” map, and the have the regexp rules override the exact ones

    But is it safe to do that? For the current regexp rule yes (there is only one, the translate="no" one), but if someone adds some more regexp stuff, then maybe not?

    b. Same as above, but add another boolean flag (overrideExactRule or something like that) and only override the exact rules if the flag is set.

    So if one adds a new regexp rule they can choose the current behavior or the old one by setting (or not) the flag.

    2. Bigger refactoring

    a. List (to preserve the rule order), containing all the rules (exact and regexp, attribute and elements)
    We iterate through all the rules, one by one, and the last one “wins”

    b. Move the decision on “who wins” out of the configuration reader.

    Have the lookup reader return all the matching rules, in the order they were found, and let the filter decide.

    I think I would like a refactoring, but in the short run I would rather go with 1.b
    Do you thing that is safe enough?

    Thank you,
    Mihai

    P.S. the options are exclusive alternatives. It is 1.a OR 1.b OR 2.a OR 2.b

  3. Mihai Nita reporter

    I also find the yaml based configuration quite unreadable, but I assume this is me, unfamiliar with yaml.

    For example is there a way to say: if (element.translate="no" OR element.class MATCHES .*notranslate.* ) then ... EXCLUDE?

    Basically

    .*:
      ruleTypes: [EXCLUDE]
      conditions: [class, MATCHES, '.*notranslate.*']
    
    '.*':
      ruleTypes: [EXCLUDE]
      conditions: [translate, EQUALS, 'no']
    

    With the 2.a refactoring that would work, because this would be a set and the .* rule can exist more than once.

    But is there a way to combine the conditions with the current implementation?

    I don’t know if this is a real limitation in the current code or a limitation in my yaml understanding :-)

    Thank you,
    Mihai

  4. Mihai Nita reporter

    Please ignore the “possible fixes” above.

    The fix is not as simple as 1.a or 1.b.

    A regexp like .*: will match everything and “swallow” all the other rules, because the “find the rule” part is separated from “check if the condition is true” part.

    So in getElementRule in YamlConfigurationReader always matches based on the ruleName, even if later on it fails the condition. But by then it is too late to go back and ask for the non-regexp rule… :-(

    Back to the drawing board…

  5. Log in to comment