Markdown filter: Text with inline HTML tags becomes fragmented translated units
Markdown allows embedded HTML element such as:
Let's throw in a <b>tag Translatable</b> to see what happen Translatable
or
<a href="http://www.youtube.com/watch?feature=player_embedded&v=YOUTUBE_VIDEO_ID_HERE" target="_blank"><img src="http://img.youtube.com/vi/YOUTUBE_VIDEO_ID_HERE/0.jpg" alt="IMAGE ALT TEXT HERE" width="240" height="180" border="10" /></a>
Each of these should generate one trans-unit in XLIFF when extracted, but in reality they end up with multiple fragmented trans-units.
First sample becomes 5 trans-units (and only 3 contain actually translatable text):
Let's throw in a
<bx id="1"/>
tag Translatable
<ex id="1"/>
to see what happen Translatable
The second sample becomes 4 trans-units (with only 1 contains translatable text):
<bx id="1"/>
IMAGE ALT TEXT HERE
<x id="1"/>
<ex id="1"/>
This is likely because of the use of HTML subfilter to process HTML inline elements and more care need to be taken when merging the events from the HTML subfilter.
Comments (5)
-
reporter -
reporter The strange side-effect has been fixed by setting mime type text/x-Markdown at the MarkdownBuilder. Because it was not set at all, remaining null, switching of Encoder within EncoderManager was not functioning properly. It is still not clear, however, why the version before fixing issue 716 was working without a problem.
-
reporter - changed status to resolved
Fixing issue 716. Also adding CDATA support and enhanced debug-level logging of Text Unit.
→ <<cset 0ed76f13ecb0>>
-
reporter Merged in ssikuro/okapi/fix_716_720 (pull request #239)
Fixing issue
#716and implementing the feature specified in issue#720Approved-by: Chase Tingley tingley+atlassian@gmail.com
→ <<cset a42df7d5189d>>
-
reporter Resolved in pull request #239
- Log in to comment
The first fix attempt has been made and the built binary was tested. It turned out this fix has a very strange side effect where all single quote character after a table cell including
<ul><li>item1</li></ul>
becomes the HTML numeric entity'
in the merged file. The single quote characters before the UL element are fine.Input file to "tikal.sh -x" command:
Merged file made by "tikal.sh -m" command:
In the XLIFF file, the single quotes are kept as they are. So this must be happening somewhere in the merge step, most likely in the filter writer layer.