- changed title to Markdown filter: Neighboring mark ups break up text into two TUs
Markdown filter: Neighboring mark ups break up text into two TUs
When two pieces of text, each is modified by inline markdown syntax, are next to each other, the text is divided into two Text Unit (internally), and produces two trans-units in XLIFF.
For example:
Okapi is a *easy-to-use* _localization_ **framework**
results in three trans-units (only source elements are shown for brevity):
<source xml:lang="en">Okapi is a <x id="1"/>easy-to-use<x id="2"/></source>
<source xml:lang="en"><x id="1"/>localization<x id="2"/></source>
<source xml:lang="en"><x id="1"/>framework<x id="2"/></source>
The cause of this is convoluted but known. The FlexMark Markdown parsing library builds node tree that looks like this:
Paragraph
├Text("Okapi is a ")
├Emphasis("*", "*")
│└Text("easy-to-use")
├Text(" ")
├Emphasis("_", "_")
│└Text("localization")
├Text(" ")
└Emphasis("**", "**")
└Text("framework")
MarkdownParser traverses this tree and makes a stream of MarkdownTokens. When it makes a token for Text node, it tags the token as translatable if there is any characters other than whitespaces. In this case, the Text node is just one space, so it marks the token untranslatable.
MarkdownFilter's main loop reads the tokens one by one. When it sees an untranslatable token of most types, it decides to end the current Text Unit, and starts a DocumentPart.
When there is anything other than spaces between the two pieces of text that are marked up, then this breakup does not happen.
It has been this way at least in M34, and probably much earlier.
Comments (6)
-
reporter -
Another case:
Click *Palettes* ![](images/GUID-EF085FB3-3CC0-4C3D-A575-C0AB50136E62-low.png) on the Factory ribbon and select *Asset Browser*.
It breaks into 2 segments:
<source>Click <ph id="1_14_ph" dataRef="d1"/>Palettes<ph id="2_14_ph" dataRef="d1"/></source> <source>images/GUID-EF085FB3-3CC0-4C3D-A575-C0AB50136E62-low.png<ph id="1_17_ph" dataRef="d1"/> on the Factory ribbon and select <ph id="217_ph" dataRef="d2"/>Asset Browser<ph id="3_17_ph" dataRef="d2"/>.</source>
And it should be 1 segment only.
-
reporter - edited description
-
reporter This has been fixed by introducing a logic to determine whether a whitespace-only piece of text is translatable or not. The new logic is if the whitespaces follow an inline markup (*, _, **, [, etc.) or a translatable piece of text, then we assume the whitespace is part of the translatable text, and mark it so. I am not sure if this is bullet proof, but so far, for all the test cases I have (including the one in the Hang's comment) this seems to be working.
-
- changed status to resolved
Fixing issue
#715→ <<cset d82ff108e3ab>>
-
Merged in ssikuro/okapi/fix_715 (pull request #242)
Fixing issue
#715Approved-by: Chase Tingley tingley+atlassian@gmail.com
→ <<cset afd87a400fa4>>
- Log in to comment