Markdown: extraction preserve spaces

Issue #1196 new
Mihai Nita created an issue

Taking the text here and extracting it with tikal -x:

The line break here
becomes a space, but the one here  
and here \
should be preserved.

(two spaces after the “one here”)

The result is:

<trans-unit id="tu10" xml:space="preserve">
<source xml:lang="en">The line break here
becomes a space, but the one here<x id="1"/>
and here <x id="2"/>
should be preserved.</source>
</trans-unit>

So the newlines are saved as newlines, and the whole trans-unit has xml:space=”preserve”

I expected something similar to html, where newline becomes a space (inside the same trans-unit)

That is how markdown is rendered.

html only adds xml:space=”preserve” to trans-units extracted from <pre>

HTML in:

<p>The line break here
becomes a space, but the one here<br>
should be preserved.</p>

HTML out as XLIFF

<trans-unit id="tu12" restype="x-paragraph">
<source xml:lang="en">The line break here becomes a space, but the one here<x id="1"/> should be preserved.</source>
</trans-unit>

I think that behavior makes more sense.

Thanks,
Mihai

Comments (2)

  1. jhargrave-straker

    Note that we decided a while back to force any extracted xliff files textunits to xml:space=”preserve” when we merge (OriginalDocumentXliffMergerStep). We do this because a container format should always preserve the original format as generated by the filter.

  2. Mihai Nita reporter

    Yes, and I agree that space preserve is a good default.

    But the way markdown is extracted now is inconsistent with how html is extracted, and I think it exposes the translators to the markdown conventions.

    html:

    <p>This is some longer line
    with nl in random places 
    that is rendered at runtime 
    into a single line with 
    collapsed spaces.</p>
    

    extracts as:

    <trans-unit id="tu3" restype="x-paragraph">
    <source xml:lang="en">This is some longer line with nl in random places that is rendered at runtime into a single line with collapsed spaces.</source>
    </trans-unit>
    

    the equivalent markdown:

    This is some longer line
    with nl in random places 
    that is rendered at runtime 
    into a single line with 
    collapsed spaces.
    

    extracts as:

    <trans-unit id="tu3" xml:space="preserve">
    <source xml:lang="en">This is some longer line
    with nl in random places
    that is rendered at runtime
    into a single line with
    collapsed spaces.</source>
    </trans-unit>
    

    As an experienced translator (or a translator with a tool that has a “live preview”), between space:preserve and seeing all the newlines, I 100% expect that the newlines matter.

    So I will try to match them, or remove them, or move them around where it makes more sense.
    And if at some point someone files a bug asking saying that lines break in the wrong places, I will be puzzled, because I can see a line breaks, and they are where they should be.

    If (for some reason) I use several spaces at the end of a line, or the beginning of one, the resulting .md has forced line breaks, or even code paragraphs.
    That is what I mean by “exposes the translators to the markdown conventions“


    TLDR: I would expect both examples to extract the same. They don’t render wrapped, and spaces don’t matter.

    So unwrap markdown lines what are unwrap in rendering.

    And even remove the space:preserve, because that attribute has it’s own place. It does not have to be blindly applied, only where it makes sense.

  3. Log in to comment