Does not flag trailing tab character; incompatibility with PyYAML.

Issue #487 invalid
Garret Wilson created an issue

I'm working on a project using SnakeYAML; I've tested this using v1.27, currently the latest.

We have a file that inadvertently ends with a tab character. Here is a reproducible test case:

test:
  - foo →
  - bar

Note that I used to represent the tab character, Unicode U+0009, as in the YAML specification.

The Python PyYAML (I tested with 3.13) library will produce something like:

Found character '\t' that cannot start any token.

SnakeYAML however parses the above document with no complaint.

The YAML 1.1 specification in section 5.5. Miscellaneous Characters (referring to the s-ignored-space production) it says:

An ignored space character outside scalar content. Such spaces are used for indentation and separation between tokens. To maintain portability, tab characters must not be used in these cases, since different systems treat tabs differently. Note that most modern editors may be configured so that pressing the tab key results in the insertion of an appropriate number of spaces.

Indeed "Example 5.12. Invalid Use of Tabs" in the specification shows these example where tabs are and are not allowed. The explanatory text says (emphasis mine):

Tabs may appear inside comments and quoted or block scalar content. Tabs must not appear elsewhere, such as in indentation and separation spaces.

# Tabs do's and don'ts:
# comment: →
quoted: "Quoted →"
block: |
  void main() {
  →printf("Hello, world!\n");
  }
elsewhere:→# separation
→indentation, in→plain scalar

In particular the line beginning with "elsewhere" as explained by the text is not allowed.

I'm not an expert in the minutiae of the YAML 1.1 spec, but on the face of it this seems to say that PyYAML is correct and that SnakeYAML isn't correctly flagging an error for a tab character at the end of a line.

One might say that it's better to be lenient, but in the case of parsing interchangeable formats with clear specifications, leniency causes problems. In our case we're using multiple programming languages in our ecosystem. If we have a file that works fine with SnakeYAML but then later breaks somewhere down the data pipeline because of PyYAML, then this is a big problem for us. We need to be able to validate our YAML 1.1 files using SnakeYAML so that we don't get a surprise by PyYAML breaking with the same data.

Is my interpretation of the PyYAML 1.1 specification correct, and is the presence of a trailing tab on a YAML line truly a violation of the spec? If so, could SnakeYAML be fixed to correctly catch and report this error?

Thank you.

Comments (5)

  1. Garret Wilson reporter

    Surprisingly the official YAML reference parser YPaste seems to accept the test case (replacing with a tab):

    test:
      - foo →
      - bar
    

    See YPaste #2085.

    However it may be that YPaste is following YAML 1.2 and not YAML 1.1. YPaste is considering the tab character as "non-content white space". The YAML 1.2 specification in section 5.5. White Space Characters says that "YAML recognizes two white space characters: space and tab." And in section 6.2. Separation Spaces it says:

    Outside indentation and scalar content, YAML uses white space characters for separation between tokens within a line. Note that such white space may safely include tab characters.

    So it looks like the rules changed for YAML 1.2, and that is probably why YPaste allows this.

    Because SnakeYAML is a YAML 1.1 parser and not a YAML 1.2 parser, it would seem to not be following the specification as closely as PyYAML (which is also a YAML 1.1 parser), which is what is causing our problems. Thus as far as I can tell this is truly a SnakeYAML bug.

  2. Andrey Somov

    Dear Garret, 10 years ago we started trying to follow even the bugs of PyYAML to be compatible. Over time our policy changed. It is now more important to follow the spec than follow PyYAML.

    Your case does not match "Example 5.12. Invalid Use of Tabs" . The tab at the end does not separate anything. This is clearly shown in the examples.

    This is just a bug in PyYAML. Please report it in PyYAML.

  3. Garret Wilson reporter

    Good morning, Andrey. I really appreciate your quick response on this.

    It is now more important to follow the spec than follow PyYAML.

    Yes, I completely agree that the spec is the best thing to follow.

    The tab at the end does not separate anything. This is clearly shown in the examples.

    I don't have enough experience in YAML tokenization to fully understand what the spec is saying, and I had been focusing on the phrase "ignored space character outside scalar content". For the moment I'll have to take your word for it that this only applies for certain contexts involving certain types of separation.

    I'll file a bug with PyYAML, then. The goal here is definitely that all parsers have consistent behavior with regard to the spec.

    Thanks!

  4. Log in to comment