Markdown: extracted XLIFF has `&#13` on every line of source/target in XLIFF for code blocks when source file has CR/LF (DOS) ending

Issue #820 resolved
Kuro Kurosaka created an issue

When extracting a Markdown file with code blocks (indented or fenced) with a DOS/Windows type CR/LF ending, the numeric entity for CR (`
`) is found at the end of every line except the last for the code blocks in the generated XLIFF file. The same symptom is observed for hard line breaks after fixing issue #695.

This seems to suggest the filter is not following the Developer Guide’s recommendation on the Line Break where the end-of-line should be normalized to LF regardless of the platform or the input file.

This happens even when tikal is run on Windows.

Comments (6)

  1. Kuro Kurosaka reporter
    • edited description

    When extracting a Markdown file with code blocks (indented or fenced) with a DOS/Windows type CR/LF ending, the numeric entity for CR (` `) is found at the end of every line for the code blocks in the generated XLIFF file.

    This is reproducible with M37. This happens even when tikal is run on Windows.

  2. Kuro Kurosaka reporter
    • marked as minor
    • edited description
    • changed version to M37

    When extracting a Markdown file with code blocks (indented or fenced) with a DOS/Windows type CR/LF ending, the numeric entity for CR (`
`) is found at the end of every line except the last for the code blocks in the generated XLIFF file. The same symptom is observed for hard line breaks after fixing issue #695.

    This is reproducible with M37. This happens even when tikal is run on Windows.

  3. Kuro Kurosaka reporter

    Pull request #313 has been made.

    This is fixed by using the DefaultEncoder rather than the MarkdownEncoder that was placed to prevent test failures on Windows. MarkdownEncoder is basically an no-op encoder that keeps the SkeletonWriter (GenericSkeletonWriter) from adjusting the line endings to the type of the original document. It was probably needed because the line ending was treated as a code. This fix treat the line break as the normal line ending, i.e. LF-only in TextUnit. A good side-effect is that after this fix, the code block like this:

    ```
    public void foo() {
              do_something();
    }
    ```
    

    will result in this source element in XLIFF:

    <source xml:lang="en"><x id="1"/>public void foo() {
    <x id="2"/>          do_something();
    <x id="3"/>}
    </source>
    

    rather than the previous:

    <source xml:lang="en"><x id="1"/>public void foo() {<x id="2"/><x id="3"/>          do_something();<x id="4"/><x id="5"/>}<x id="6"/></source>
    

    which is very hard to read and understand.

    Even after this fix, the reconstructed (extracted+merged) document of code-blocks-crlf.md is different from the original file; it does not retain the extra spaces before the code block when the fence has extra spaces. This should not be a real problem because they are semantically equivalent. But if that is a problem, it should be addressed as a new issue.

  4. Log in to comment