Markdown filter: Neighboring mark ups break up text into two TUs

Issue #715 resolved
Kuro Kurosaka created an issue

When two pieces of text, each is modified by inline markdown syntax, are next to each other, the text is divided into two Text Unit (internally), and produces two trans-units in XLIFF.

For example:

Okapi is a *easy-to-use* _localization_ **framework**

results in three trans-units (only source elements are shown for brevity):

<source xml:lang="en">Okapi is a <x id="1"/>easy-to-use<x id="2"/></source>
<source xml:lang="en"><x id="1"/>localization<x id="2"/></source>
<source xml:lang="en"><x id="1"/>framework<x id="2"/></source>

The cause of this is convoluted but known. The FlexMark Markdown parsing library builds node tree that looks like this:

Paragraph
├Text("Okapi is a ")
├Emphasis("*", "*")
│└Text("easy-to-use")
├Text(" ")
├Emphasis("_", "_")
│└Text("localization")
├Text(" ")
└Emphasis("**", "**")
 └Text("framework")

MarkdownParser traverses this tree and makes a stream of MarkdownTokens. When it makes a token for Text node, it tags the token as translatable if there is any characters other than whitespaces. In this case, the Text node is just one space, so it marks the token untranslatable.

MarkdownFilter's main loop reads the tokens one by one. When it sees an untranslatable token of most types, it decides to end the current Text Unit, and starts a DocumentPart.

When there is anything other than spaces between the two pieces of text that are marked up, then this breakup does not happen.

It has been this way at least in M34, and probably much earlier.

Comments (6)

  1. Sun Hang

    Another case:

    Click *Palettes* ![](images/GUID-EF085FB3-3CC0-4C3D-A575-C0AB50136E62-low.png) on the Factory ribbon and select *Asset Browser*.
    

    It breaks into 2 segments:

    <source>Click <ph id="1_14_ph" dataRef="d1"/>Palettes<ph id="2_14_ph" dataRef="d1"/></source>
    
    <source>images/GUID-EF085FB3-3CC0-4C3D-A575-C0AB50136E62-low.png<ph id="1_17_ph" dataRef="d1"/> on the Factory ribbon and select <ph id="217_ph" dataRef="d2"/>Asset Browser<ph id="3_17_ph" dataRef="d2"/>.</source>
    

    And it should be 1 segment only.

  2. Kuro Kurosaka reporter

    This has been fixed by introducing a logic to determine whether a whitespace-only piece of text is translatable or not. The new logic is if the whitespaces follow an inline markup (*, _, **, [, etc.) or a translatable piece of text, then we assume the whitespace is part of the translatable text, and mark it so. I am not sure if this is bullet proof, but so far, for all the test cases I have (including the one in the Hang's comment) this seems to be working.

  3. Log in to comment