tikal manual round trip test fails on a markdown file that has a emphasized paragraph with full external reference
This bug was found during discussion https://groups.google.com/g/okapi-devel/c/sXySLd5pYIM
The attached test.md file consists of this one line:
*Note: the fourth item uses the Unicode character for [Roman numeral four][2].*
which is believed to be an emphasized paragraph that is part of Markdown’s Full Reference Link construct.
When tikal is run for a round trip test, that is:
tikal.sh -x test.md
tikal.sh -m test.md.xlf
The reconstructed test.out.md is different than from the original file:
*Note: the fourth item uses the Unicode character for [Roman numeral four][2*.]
Notice “[2].*” at the end of the line has become “[2*.]”.
It is observed that the “*” are represented by <bx id=”1”/> and <ex id=”1”> in test.md.xlf as in:
<source xml:lang="en"><bx id="1"/>Note: the fourth item uses the Unicode character for <g id="2">Roman numeral four</g><g id="3"></g><ex id="4"/>.<ex id="1"/></source>
@Jim Hargrave (OLD) suggested to use his new merge code in the fix_textunitmerger branch, and modify Tikal to use the “non-simplified mode” of XLIFFWriter. After these changes, the round trip test was succeeded. But these error messages were shown during the merge:
Code mismatch in TextUnit tu1
Can't find matching Code(s) id='1' originalId='' data=']'
Can't find matching Code(s) id='4' originalId='' data=']'
and the part in the generated XLIFF file corresponding to ““[2*.]” does not look right:
<source xml:lang="en"><it id="1" pos="open">*</it>Note: the fourth item uses the Unicode character for <bpt id="2">[</bpt>Roman numeral four<ept id="2">]</ept><bpt id="3">[</bpt><ept id="3">2</ept><it id="4" pos="close">]</it>.<it id="1" pos="close">*</it></source>
Note that “2” is treated as a closing tag for “[“. Jim does not think “*” should be represented by isolated tags.
Comments (3)
-
-
reporter I am not still able to reproduce the claim. But I am sure that MarkdownFilter is not handling the full link references correctly. I created issue #1124 MarkdownFilter is not handling full reference links correctly, generating three codes for [link-label] that is most likely causing, or strongly contributing to the unit test case failure.
-
reporter -
assigned issue to
-
assigned issue to
- Log in to comment
I confirmed the Segmenter is changing the isolated codes to paired codes (bpt/ept).
It happens here:
Since we get both the **source** and target content from the xliff file - then try to match the original file source (unsegmented) on merge - that's why we are getting mismatches.
I think what we need for the MarkDownFilter is a smart post-processor that can repair these illegal codes before they leave the filter.