MarkdownFilter is not handling full reference links correctly, generating three codes for [link-label]

Create issue
Issue #1124 new
Kuro Kurosaka (BH Lab) created an issue

It seems that MarkdownFilter is not handling the Full Reference Links correctly.

For example, the attached is made of just a single line that looks like this:

Note: the fourth item uses the Unicode character for [Roman numeral four][2].

From this, -x generates the XLIFF file which contains this source element:

<source xml:lang="en">Note: the fourth item uses the Unicode character for <g id="1">Roman numeral four</g><g id="2"></g><ex id="3"/>.</source>

From other experiments and observations, it is believed that:

  • <g id="2"> corresponds to “[”

    </g> corresponds to “2”

    <ex id="3"/> corresponds to “]”

This does not make much sense. “[2]” is a reference label, referencing another line that has the target URL and the help tip. This should probably just generate a generic place holder <x id=2/>. (Another thought may be that since the reference label (“2” in this case) can be a meaningful word or phrase, it should be subject to translation, but in that case the reference label should be treated as a subflow, generating a separate translation unit. Even if we follow that thought, “[2]” should generate a generic place holder <x id=2/>.)

Comments (7)

  1. Kuro Kurosaka (BH Lab) reporter

    Running the MarkdownFilter under debugger shows that MarkdownParser.parse("This is [anchor text][ref].\n") adds these tokens to tokenQueue:

     0 = {MarkdownToken@3253} "This is" , true, TEXT
     1 = {MarkdownToken@3254} "[", false, LINK_REF
     2 = {MarkdownToken@3255} "anchor text", true, TEXT
     3 = {MarkdownToken@3256} "]", false, LINK_REF  
     4 = {MarkdownToken@3257} "[", false, LINK_REF
     5 = {MarkdownToken@3258} "linkRef", false, LINK_REF
     6 = {MarkdownToken@3268} "]", false, LINK_REF

    These tokens are processed by eventually calls the private method addCode(final MarkdownToken token) with each of these tokens in sequence. At this point, there is no obvious distinction between the LINK_REF token from the first “[anchor text]” and “[ref]”, and it handles both “[x]” in the same way. That seems to be causing this bug. A good fix seems to change the parse method so it generates a token of a new token type for the full link reference “[ref]”, and enhance the addCode method to be aware of the new token type.

  2. Kuro Kurosaka (BH Lab) reporter

    The current code has a hard-coded knowledge that “[“ is the opening tag but it is not checking for “]” as the closing tag. Instead, it treats whatever token that comes next that isn’t a quoted string as a closing tag. Because of that “linkRef” is treated as a closing tag (code).

    See the definition of net.sf.okapi.filters.markdown.MarkdownFilter#addCode, currently found found around

    The token representation of the full-reference link needs to be changed.

  3. Kuro Kurosaka (BH Lab) reporter

    Due to my personal circumstances, I wouldn’t be work on this for a month or two. If anyone would like to take it over, feel free. (I tried to remove myself from Assignee but I couldn’t figure out how.)

  4. Jim Hargrave Work

    @bhlkuro No worries Kuro we are all in the same boat. Thanks for documenting this so well!

  5. Log in to comment