Markdown filter: extra spaces and empty lines added/removed

Issue #687 open
Kuro Kurosaka created an issue

When a .md file is processed by tikal.sh -x to generate a .md.xlf file, and then it is merged without change by tikal.sh -m, the resulting .out.md file is different than the original .md file. They tend to include extra spaces and empty lines.

Two sample files, found in the comment in another issue #686, are attached.

$ ./tikal.sh -x space1.md
$ ./tikal.sh -m space1.md.xlf 
$ diff space1.md space1.out.md
1,8c1,10
< *   **PointCloudType** - type of point cloud loaded into a Revit document. Each PointCloudType maps to a single file or identifier (depending upon the type of Point Cloud Engine which governs it).
< *   **PointCloudInstance** - an instance of a point cloud in a location in the Revit project.
< *   **PointCloudFilter** - a filter determining the volume of interest when extracting points.
< *   **PointCollection** - a collection of points obtained from an instance and a filter.
< *   **PointIterator** - an iterator for the points in a PointCollection.
< *   **CloudPoint** - an individual point cloud point, representing an X, Y, Z location in the coordinates of the cloud, and a color.
< *   **PointCloudOverrides** - and its related settings classes specify graphic overrides that are stored by a view to be applied to a PointCloudInstance element, or a scan within the element.
< ### Point cloud file paths
---
> * **PointCloudType** - type of point cloud loaded into a Revit document. Each PointCloudType maps to a single file or identifier (depending upon the type of Point Cloud Engine which governs it).
> * **PointCloudInstance** - an instance of a point cloud in a location in the Revit project.
> * **PointCloudFilter** - a filter determining the volume of interest when extracting points.
> * **PointCollection** - a collection of points obtained from an instance and a filter.
> * **PointIterator** - an iterator for the points in a PointCollection.
> * **CloudPoint** - an individual point cloud point, representing an X, Y, Z location in the coordinates of the cloud, and a color.
> * **PointCloudOverrides** - and its related settings classes specify graphic overrides that are stored by a view to be applied to a PointCloudInstance element, or a scan within the element.
> 
>    ### Point cloud file paths
> 

Comments (9)

  1. Sun Hang

    Another case for this issue:

    *   **PointCloudType** - type of point cloud loaded into a Revit document. Each PointCloudType maps to a single file or identifier (depending upon the type of Point Cloud Engine which governs it).
    *   **PointCloudInstance** - an instance of a point cloud in a location in the Revit project.
    *   **PointCloudFilter** - a filter determining the volume of interest when extracting points.
    *   **PointCollection** - a collection of points obtained from an instance and a filter.
    *   **PointIterator** - an iterator for the points in a PointCollection.
    *   **CloudPoint** - an individual point cloud point, representing an X, Y, Z location in the coordinates of the cloud, and a color.
    *   **PointCloudOverrides** - and its related settings classes specify graphic overrides that are stored by a view to be applied to a PointCloudInstance element, or a scan within the element.
    
        ### Point cloud file paths
    

    If we have several spaces before the hash, new line will be added for each of line (using tikal)

    * **PointCloudType** - type of point cloud loaded into a Revit document. Each PointCloudType maps to a single file or identifier (depending upon the type of Point Cloud Engine which governs it).
    
    * **PointCloudInstance** - an instance of a point cloud in a location in the Revit project.
    
    * **PointCloudFilter** - a filter determining the volume of interest when extracting points.
    
    * **PointCollection** - a collection of points obtained from an instance and a filter.
    
    * **PointIterator** - an iterator for the points in a PointCollection.
    
    * **CloudPoint** - an individual point cloud point, representing an X, Y, Z location in the coordinates of the cloud, and a color.
    
    * **PointCloudOverrides** - and its related settings classes specify graphic overrides that are stored by a view to be applied to a PointCloudInstance element, or a scan within the element.
    
       ### Point cloud file paths
    
  2. Kuro Kurosaka reporter
    ./tikal.sh -x empty-line-test.md
    ./tikal.sh -m empty-line-test.md.xlf
    

    generates empty-line-test.out.md that have only one empty lines between non-empty lines. Extra empty lines are removed.

    However, this behavior may be correct from the Markdown spec point of view. It seems a run of more than one empty lines are interpreted just as one empty line and rendered as such. This is how empty-line-test.md is rendered on GitHub: empty-line-test-rendered-on-github.png

  3. Kuro Kurosaka reporter

    In this test, each line starts with a different number of spaces. Within a Markdown paragraph unit, i.e. the lines without an empty line between them, the leading spaces are completely removed. This is justifiable from the Markdown syntax point of view because they are actually removed. Spaces mean nothing when they are rendered as a paragraph. See the screen shot of the GitHub rendering of this Markdown file:simple-space-test-rendered-on-github.png

    From each Markdown paragraph, meaning a line following an empty lines and is followed by an empty line, 1-3 spaces are removed and 4-6 spaces are reduced to 4 spaces. Reducing the lead 1, 2, or 3 spaces could be justified because because they are semantically equal in Markdown. But normalizing 4, 5, or 6 spaces into 4 spaces is not justifiable because they have different semantics; the leading 4 spaces indicate the beginning of code and spaces after 4th space are rendered as they are. See the screen shot.

  4. Kuro Kurosaka reporter

    There are several technical limitations in fixing this issue. Because of these, this issue will not be completely resolved.

    Within HTML

    The HTML filter, which the Markdown filter uses to process HTML elements and blocks, changes the number of and kind of white spaces since the number of white spaces carry no meaning in HTML element except within the pre element. So any Markdown document that includes HTML blocks with newlines or multiple spaces cannot be restored from .xlf file. For example,

    <p>This paragraph was originally made of
    two lines and there were extra spaces         here.</p>
    

    will become:

    <p>This paragraph was originally made of two lines and there were extra spaces here.</p>
    

    Extra Newlines at EOF

    There will be a newline inserted at the end of the file, if the original file ends without the newline. This is necessary because Flexmark loses the information about the end of the file in its subtree in certain occasions. For example, if a list like below is at the end of the file and there is no newline at the end:

    * list item 1
    * list item 2, not followed by a newline
    

    Flexmark makes two paragraph nodes, one for "list item 1" and another for "list item 2, not followed by a newline". Usually, a paragraph node includes a newline but not when it appears as a list item. Because of that the filter must add a newline after each item list. We could further analyze the top node to see a newline is there and if the node we are dealing is the last node, but that would further complicate the code and the performance would suffer. It was felt this is an acceptable limitation.

  5. Kuro Kurosaka reporter
    • changed status to open

    The status of this issue was incorrectly changed to resolved. The recent pull request #226 resolves this issue only partially for the extra/removed empty lines under certain situations. It does not resolve the extra/reduced spaces aspect of the issue. I am changing it back to open.

  6. Log in to comment