Markdown filter: Text with inline HTML tags becomes fragmented translated units

Issue #716 resolved
Kuro Kurosaka created an issue

Markdown allows embedded HTML element such as:

Let's throw in a <b>tag Translatable</b> to see what happen Translatable

or

<a href="http://www.youtube.com/watch?feature=player_embedded&v=YOUTUBE_VIDEO_ID_HERE" target="_blank"><img src="http://img.youtube.com/vi/YOUTUBE_VIDEO_ID_HERE/0.jpg" alt="IMAGE ALT TEXT HERE" width="240" height="180" border="10" /></a>

Each of these should generate one trans-unit in XLIFF when extracted, but in reality they end up with multiple fragmented trans-units.

First sample becomes 5 trans-units (and only 3 contain actually translatable text):

  1. Let's throw in a
  2. <bx id="1"/>
  3. tag Translatable
  4. <ex id="1"/>
  5. to see what happen Translatable

The second sample becomes 4 trans-units (with only 1 contains translatable text):

  1. <bx id="1"/>
  2. IMAGE ALT TEXT HERE
  3. <x id="1"/>
  4. <ex id="1"/>

This is likely because of the use of HTML subfilter to process HTML inline elements and more care need to be taken when merging the events from the HTML subfilter.

Comments (5)

  1. Kuro Kurosaka reporter

    The first fix attempt has been made and the built binary was tested. It turned out this fix has a very strange side effect where all single quote character after a table cell including <ul><li>item1</li></ul> becomes the HTML numeric entity &#39; in the merged file. The single quote characters before the UL element are fine.

    Input file to "tikal.sh -x" command:

    | Expression with HTML | Text with a quote |
    | ------------- | ------------- |
    | Hello, <b>World</b>! | I'm still good, am I? |
    | <div>A div block</div> | It's getting complicated |
    | <ul><li>item1</li></ul> | It's bad here |
    
    And it's still bad here.
    

    Merged file made by "tikal.sh -m" command:

    | Expression with HTML | Text with a quote |
    | ------------- | ------------- |
    | Hello, <b>World</b>! | I'm still good, am I? |
    | <div>A div block</div> | It's getting complicated |
    | <ul><li>item1</li></ul> | It&#39;s bad here |
    
    And it&#39;s still bad here.
    

    In the XLIFF file, the single quotes are kept as they are. So this must be happening somewhere in the merge step, most likely in the filter writer layer.

  2. Kuro Kurosaka reporter

    The strange side-effect has been fixed by setting mime type text/x-Markdown at the MarkdownBuilder. Because it was not set at all, remaining null, switching of Encoder within EncoderManager was not functioning properly. It is still not clear, however, why the version before fixing issue 716 was working without a problem.

  3. Log in to comment