The markdown filter does not handle html tags properly.

Issue #928 resolved
Mihai Nita created an issue

One would expect that markdown and the equivalent html constructs would result in the same extracted events (including codes in TextUnits), but it does not.

For example these two lines should produce the same events:

This is a [link](foo.html).

This is a <a href="foo.html">link</a>.

and same for these two lines:

This is a **bold**.

This is a <b>bold</b>.

The open-close html tags produce 2 standalone codes, instead of open-close.
And the link is even worse, it produces 2 TextUnits, one with This is a and the second one with <x/>link<x/>.

See attached files.

Comments (13)

  1. Mihai Nita reporter

    A quick reading of the code seems to indicate that the flexmark parser is invoked to generate tokens, then each html token is passed to the html subfilter.

    The tags for the link above would be something like

    [
        {text, "This is a"}
        {html, "<a href=...>"} => pass to the html subfilter
        {text, "link"}
        {html, "</a>"} => pass to the html subfilter
        {text, "."}
    ]
    

    This means that open-close tags (like <a href='foo'> and </a>) will be always passed to the html parser separately. So they will never generate correct open-close Codes.

  2. Qabiria

    Related to this, if I use Translation Kit Creation in Rainbow and try to convert a markdown file to XLIFF using the default HTML subfilter, href attribute is not extracted. The link is recognized, but the actual content of the href attribute (a URL) is not there.

    This is the content of the md file I’m trying to convert, if that helps. Nothing fancy, as you can see.

    ---
    title: 'Traduzioni per il marketing'
    image: traduzione.png
    alt: 'Collage con uomo e autobotte'
    ---
    
    # Traduzioni per il marketing internazionale
    
    <p class="font-20">Ti aiutiamo a vendere allestero. Grazie al <a href="/perche-qabiria">Sistema Q</a>, il tuo messaggio va dritto al punto. Non importa quanto lontano tu voglia diffonderlo.</p> 
    
    [Contattaci](/contattaci?classes=btn,btn-primary,btn-rounded,mb-2)
    

  3. Mihai Nita reporter

    The markdown filter should not need an HTML subfilter, should recognize HTML tags on its own.

    In HTML (at least the way Okapi treats it by default) the URLs are not localizable.
    You would need a custom configuration.

    You can see that here:
    okapi/filters/html/src/main/resources/net/sf/okapi/filters/html/nonwellformedConfiguration.yml

    The href is marked as writableLocalizableAttributes.
    To make it localizable you would need a custom HTML configuration marking it as translatableAttributes

    I also notice that you have a section delimited by ---, which the Markdown filter calls “YAML Header”.
    You might want to make that localizable too (by default it is not)

  4. Mihai Nita reporter

    I don’t see a way to attach files to an existing issue, only to a new one :-(

    So I will try to show here what to do:

    1. Custom HTML filter config

    Copy the file from okapi/filters/html/src/main/resources/net/sf/okapi/filters/html/nonwellformedConfiguration.yml as okf_html@localizable-href.fprm

    Apply the following differences:

    179,180c179
    <     translatableAttributes: [title, accesskey, download]
    <     writableLocalizableAttributes: [href]
    ---
    >     translatableAttributes: [title, accesskey, download, href]
    201c200
    <     writableLocalizableAttributes: [href]
    ---
    >     translatableAttributes: [href]
    256,257c255,256
    <     translatableAttributes: [title, alt]
    <     writableLocalizableAttributes: [href, src]
    ---
    >     translatableAttributes: [title, alt, href]
    >     writableLocalizableAttributes: [src]
    

    As you can see, it basically moves the href attributes to translatableAttributes

    2. Custom markdown config

    I’ve named it okf_markdown@localizable-href.fprm

    #v1
    useCodeFinder.b=false
    translateUrls.b=false
    urlToTranslatePattern=.+
    translateCodeBlocks.b=true
    translateInlineCodeBlocks.b=true
    translateHeaderMetadata.b=true
    translateImageAltText.b=true
    codeFinderRules.count.i=1
    codeFinderRules.rule0=\{\{[^}]+\}\}
    codeFinderRules.sample={{#test}} handle bar test {{/test}}$0a${{stand-alone handle bar}}$0a$
    codeFinderRules.useAllRulesWhenTesting.b=true
    htmlSubfilter=okf_html@localizable-href
    

    The relevant differences from default are:

    • htmlSubfilter=okf_html@localizable-href (line 13) (the name should of course match the HTML custom config file)

    • translateHeaderMetadata.b=true (line 7) (translate the Yaml header, if you want)


    Instead of a custom HTML configuration you can change translateUrls.b=true (line 4) and thinker with urlToTranslatePattern

    It might be simpler in this case, and would work pretty well.

    But in general you can get better control with a custom HTML config.

  5. Mihai Nita reporter

    I’ve tried translateUrls.b and it did not work as I expected.

    It extracted (/contattaci?classes=btn,btn-primary,btn-rounded,mb-2) for translation, but not the href

    So depending what you want to translate you might need to enable both.

    But in general links should not be localizable, at least not by translators.
    Should be done algorithmically, by code.
    This is why Okapi makes them writableLocalizableAttributes and not simply non-localizable.


    To summarize: not the same issue.

  6. Qabiria

    Thanks for your help, Mihai. And what about the original issue? Any chances it can be fixed? It’s kind of a show-stopper for us, because we have a legacy TM, properly segmented, that becomes useless if the Markdown filter produces separate segments where there are links.

  7. Kuro Kurosaka (BH Lab)

    @Mihai Nita , I’d like to respond to your earlier comment “The markdown filter should not need an HTML subfilter, should recognize HTML tags on its own.”. If my memory is correct, the Markdown filter, and Flexmark parser that it uses, is written to handle a Markdown dialect described as CommonMark. It has many complex and subtle rules. For example, Examples 138 says *foo* in the line <del>*foo*</del> must be interpreted as a markdown text, and displayed as “foo” in an emphasized form. That is why the Flexmark treats the opening HTML tag and the closing opening tag as separate tokens, rather than the entire line as an HTML text. You could argue that the flexmark parser should handle HTML, in addition to the Markdown proper, but that would make the parser very complex and gigantic code, probably more than a single contributor of the flexmark project can and is willing to handle.

  8. Chase Tingley

    Kuro is correct about the complexity here, the correct extraction of <dev>*foo*</dev> would be <g id=”1”><g id=”2”>foo</g></g>. Passing the whole thing to the subfilter has the same problem that the subfiltering step has to deal with, which is that some codes (id=”2”, in this case) are generated by the parent filter while others are generated by the child. It might be easier for the markdown filter to parse out the tags from flexmark’s HTML events and maintain its own tag stack, but then that means we lose the configuration layer from the subfilter.

  9. Mihai Nita reporter

    I am definitely not denying the complexity.

    It is reasonable to expect that ...**foo**... and ...<b>foo</b>... generate the same thing.
    And it is not a disaster if it generates ...<x/>foo<x/>....

    But generating 2 text units from This is a <a href="foo.html">link</a>. is bad, it is preventing a good quality translation.

    I don’t know what the solution is, I don’t see an easy one.
    And I am not saying that it is on Kuro to fix.


    My comment about the markdown filter not needing an HTML subfilter was about this:

    I use Translation Kit Creation in Rainbow and try to convert a markdown file to XLIFF using the default HTML subfilter

    Since “by definition” the markdown format allows for HTML tags, it is the job of the markdown filter to call the HTML subfilter (as it already does)
    Or implement its own HTML parsing, or invoke some magic to deal with the HTML tags.

    But it is not the job of the user of the markdown filter. It should just happen by default (and it does).

    That’s what I meant with “The markdown filter should not need an HTML subfilter, should recognize HTML tags on its own.”

    When you look “from the outside” there should be no need for a subfiter.

    I probably misinterpreted what Qabiria said.
    I’ve understood “convert a markdown file to XLIFF using the default HTML subfilter
    as using default html subfitler “in addition” of the markdown filter, which already has its own html subfilter.
    The meaning was probably “convert a markdown file to XLIFF using the default configuration of the HTML subfilter“
    (the markdown filter allows one to provide custom configurations for the html and the yaml subfilters, which is very nice)

  10. Chase Tingley

    Alec’s PR #572 does not fully fix HTML handling in Markdown, but it fixes the cases identified here – it causes inline HTML tag pairs to be correctly treated as paired codes. We are still subfiltering HTML content one tag a time (to avoid the problem Kuro describes), but the markdown filter now tracks tag pairing for HTML inline elements separately, so it’s able to correctly pair things.

  11. Log in to comment