okapiframework / Okapi / issues / #685 - Markdown filter extraction/re-merge produces extra <ul> tag used in table cells. — Bitbucket

Issue #685 resolved

Kuro Kurosaka created an issue 2018-02-21

The attached file, ul-in-table-cell.md has a table cell that includes HTML unordered list element <ul>...</ul>. When tikal -x is applied to this file to generated the .md.xlf file, then tikal -m is applied to re-generated the .md file, the re-generated file has an extra <ul> tag.

Original cell content: <ul><li>item1</li><li>item2</li></ul>
Re-generated: <ul><ul><li>item1</li><li>item2</li></ul>

Comments (6)

Kuro Kurosaka reporter
Here is what I know so far. Markdown filter uses flexmark-java library to parse the MD document. It reads the MD document and build a tree of Nodes and returns it to Markdown filter. If it is a stand-alone run of HTML file occupying the entire line or lines, the entire thing is put together and one HtmlBlock node is built. The Markdown filter passes it to the HTML filter for further processing. But in the case of HTML tags within a table cell, this doesn't happen. Flexmark-java creates a node of type HtmlInline for each HTML tag, and currently the Markdown filter passes each node to HTML. So for this example, "<ul>" is passed to the HTML filter, "<li>", is passed to the HTML filter, and so forth. Because "<ul>" is a tag that requires the end tag, the HTML subfilter gets confused when it only see "<ul>". It tries to do something, and that something at this point is add another "<ul>" (not "</ul>"). The author of Flex-java confirmed that each HTML tag maps to an HtmlInline node is an intended behavior.

One possible solution is for Flexmark to detect a neighboring HtmlInline nodes and Text nodes, and putting together to create a longer span of HTML tags. To be exact, when a HtmlInline node is found, we start collecting the string, as long as the next node is either HtmlInline, HtmlComment, Text, until we see something else. Then we back up to the last HtmlInline node, and we pass the concatenated text to the HTML filter.

This most likely to have a side effect of mis-identifying the text nodes that is not meant to be part of a HTML block in theory, but it might worth trying.
- 2018-02-23T17:37:32+00:00
Kuro Kurosaka reporter
- attached DirectShape.md
DirectShape.md has a <tbody> tag, <td> tags directly under <table> tag. They each appear twice after the xlf file is merged. Another report says if all empty lines are removed, duplication doesn't happen. It is suspected this may share the same cause.
- 2018-03-06T23:17:32+00:00
Kuro Kurosaka reporter
It turned out the issue with DirectShape.md is closely related to this issue but shows a more fundamental issue. A separate issue, issue 694, has been filed.
- 2018-03-14T08:31:13+00:00
Kuro Kurosaka reporter
It turned out the HTML filter has difficulty handling partial (unbalanced) HTML text only in its okf_html-wellformed variant, which Markdown filter's HTML subfilter configuration was based on. When I swtiched to an okf_html based configuration, insertion of an extra "ul" event stopped happening.
- 2018-03-17T06:47:35+00:00
Chase Tingley
- changed version to M35
- assigned issue to
  
  Kuro Kurosaka
- changed milestone to M36
- 2018-03-30T21:26:02+00:00
Chase Tingley
- changed status to resolved
- 2018-03-30T21:26:09+00:00
Log in to comment

Assignee: Kuro Kurosaka

Type: bug

Priority: major

Status: resolved

Milestone: M36

Version: M35

Votes: 0

Watchers: 1

Jira: the preferred issue tracker for Bitbucket. Join the team!