tikal manual round trip test fails on a markdown file that has a emphasized paragraph with full external reference

Issue #1123 new
Kuro Kurosaka (BH Lab) created an issue

This bug was found during discussion https://groups.google.com/g/okapi-devel/c/sXySLd5pYIM

The attached test.md file consists of this one line:

*Note: the fourth item uses the Unicode character for [Roman numeral four][2].*

which is believed to be an emphasized paragraph that is part of Markdown’s Full Reference Link construct.

When tikal is run for a round trip test, that is:

tikal.sh -x test.md
tikal.sh -m test.md.xlf

The reconstructed test.out.md is different than from the original file:

*Note: the fourth item uses the Unicode character for [Roman numeral four][2*.]

Notice “[2].*” at the end of the line has become “[2*.]”.

It is observed that the “*” are represented by <bx id=”1”/> and <ex id=”1”> in test.md.xlf as in:

<source xml:lang="en"><bx id="1"/>Note: the fourth item uses the Unicode character for <g id="2">Roman numeral four</g><g id="3"></g><ex id="4"/>.<ex id="1"/></source>

@Jim Hargrave (OLD) suggested to use his new merge code in the fix_textunitmerger branch, and modify Tikal to use the “non-simplified mode” of XLIFFWriter. After these changes, the round trip test was succeeded. But these error messages were shown during the merge:

Code mismatch in TextUnit tu1
Can't find matching Code(s) id='1' originalId='' data=']'
Can't find matching Code(s) id='4' originalId='' data=']'

and the part in the generated XLIFF file corresponding to ““[2*.]” does not look right:

<source xml:lang="en"><it id="1" pos="open">*</it>Note: the fourth item uses the Unicode character for <bpt id="2">[</bpt>Roman numeral four<ept id="2">]</ept><bpt id="3">[</bpt><ept id="3">2</ept><it id="4" pos="close">]</it>.<it id="1" pos="close">*</it></source>

Note that “2” is treated as a closing tag for “[“. Jim does not think “*” should be represented by isolated tags.

Comments (3)

  1. Jim Hargrave (OLD)

    I confirmed the Segmenter is changing the isolated codes to paired codes (bpt/ept).

    It happens here:  

     getSource().getSegments().create(segmenter.getRanges());
    

    Since we get both the **source** and target content from the xliff file - then try to match the original file source (unsegmented) on merge - that's why we are getting mismatches.

    I think what we need for the MarkDownFilter is a smart post-processor that can repair these illegal codes before they leave the filter.

  2. Log in to comment