Document size limit should be applied to single document not the whole input stream

Issue #1065 resolved
Lee Read created an issue

Issue from Mailing List

While the SnakeYAML issues were down due to spam problems, Petr Gladkikh raised this issue on the SnakeYAML
mailing list:

Hello,

// As a side note, the links in the project README to Slack and the Jira did not work for me.

Link to Slack leads to an empty space, and bug-tracker link says "We can't let you see this page". Probably these should be updated. Google groups link is the only one that I could use.

After recent changes (between 1.26 and 1.33). Loading long streams of documents hits document size limit. It is unusual to have very long or even indefinite input streams. As I can tell Yaml.loadAll method parses documents lazily and they are not accumulated inside of the parser, so limit on the whole input stream size does not make sense technically. Besides, the error message mentions "YAML document" which implies it is per document not per stream limit.

So I believe the limit should be applied to single document and be reset very time a complete document is loaded.

An example of the stack trace where the limit is hit (Snakeyaml v1.33) is below. In my case each document in the input file is only few kilobytes in length (the whole input file is about 12Mb), so there should be no problem reading as many of them as necessary.

Thanks.


The incoming YAML document exceeds the limit: 3145728 code points.
at org.yaml.snakeyaml.scanner.ScannerImpl.fetchMoreTokens(ScannerImpl.java:342)
at org.yaml.snakeyaml.scanner.ScannerImpl.checkToken(ScannerImpl.java:263)
at org.yaml.snakeyaml.parser.ParserImpl$ParseBlockMappingKey.produce(ParserImpl.java:662)
at org.yaml.snakeyaml.parser.ParserImpl.peekEvent(ParserImpl.java:185)
at org.yaml.snakeyaml.comments.CommentEventsCollector$1.peek(CommentEventsCollector.java:57)
at org.yaml.snakeyaml.comments.CommentEventsCollector$1.peek(CommentEventsCollector.java:43)
at org.yaml.snakeyaml.comments.CommentEventsCollector.collectEvents(CommentEventsCollector.java:136)
at org.yaml.snakeyaml.comments.CommentEventsCollector.collectEvents(CommentEventsCollector.java:116)
at org.yaml.snakeyaml.composer.Composer.composeScalarNode(Composer.java:239)
at org.yaml.snakeyaml.composer.Composer.composeNode(Composer.java:208)
at org.yaml.snakeyaml.composer.Composer.composeValueNode(Composer.java:357)
at org.yaml.snakeyaml.composer.Composer.composeMappingChildren(Composer.java:336)
at org.yaml.snakeyaml.composer.Composer.composeMappingNode(Composer.java:311)
at org.yaml.snakeyaml.composer.Composer.composeNode(Composer.java:212)
at org.yaml.snakeyaml.composer.Composer.composeValueNode(Composer.java:357)
at org.yaml.snakeyaml.composer.Composer.composeMappingChildren(Composer.java:336)
at org.yaml.snakeyaml.composer.Composer.composeMappingNode(Composer.java:311)
at org.yaml.snakeyaml.composer.Composer.composeNode(Composer.java:212)
at org.yaml.snakeyaml.composer.Composer.getNode(Composer.java:134)
at org.yaml.snakeyaml.constructor.BaseConstructor.getData(BaseConstructor.java:168)
at org.yaml.snakeyaml.Yaml$1.next(Yaml.java:499)

I asked on the mailing list if an issue was raised for this one, and @Andrey Somov responded that he had fixed it. Since issues are now working again, he suggested that I raise an issue for it here.

Reproduction under SnakeYAML 2.0

I wrote a small program to illustrate the problem:

package org.example;

import org.yaml.snakeyaml.LoaderOptions;
import org.yaml.snakeyaml.Yaml;
import java.util.Iterator;

public class Main {
    private static void dumpAllDocs(String input, long codePointLimit) {
        System.out.println ("Loading all docs with codePointLimit of "+ codePointLimit);
        LoaderOptions loaderOpts = new LoaderOptions();
        loaderOpts.setCodePointLimit((int) codePointLimit);
        Yaml yaml = new Yaml(loaderOpts);

        Iterator<Object> docs = yaml.loadAll(input).iterator();

        for (int ndx = 1; ndx <= 3; ndx++) {
            try {
                Object doc = docs.next();
                System.out.println("doc " + ndx + " loaded: " + doc);
            } catch (Exception e) {
                System.out.println("doc " + ndx + " failed: " + e.getMessage());
                return;
            }
        }
    }

    public static void main(String[] args) {
        String doc1 = "document: this is document one\n";
        String doc2 = "document: this is document 2\n";
        String doc3 = "document: this is document three\n";
        String input = doc1 + "---\n" + doc2 + "---\n" + doc3;

        System.out.println ("doc1 size: " + doc1.codePoints().count());
        System.out.println ("doc2 size: " + doc2.codePoints().count());
        System.out.println ("doc3 size: " + doc3.codePoints().count());
        System.out.println ("input size:" + input.codePoints().count());

        System.out.println ("\nTest1. All should load, all docs are less than total input size.");
        dumpAllDocs(input, input.codePoints().count());

        System.out.println ("\nTest2. All should load, all docs are less than total input size - 1.");
        dumpAllDocs(input, input.codePoints().count() -1);

        System.out.println ("\nTest3. All should load, all docs are less or equal to doc3 size.");
        dumpAllDocs(input, doc3.codePoints().count());

        System.out.println ("\nTest4. Should fail to load at 3rd doc, it is longer than doc3 size -1.");
        dumpAllDocs(input, doc3.codePoints().count() - 1);
    }
}

When the code is run against SnakeYAML 2.0, it outputs:

doc1 size: 31
doc2 size: 29
doc3 size: 33
input size:101

Test1. All should load, all docs are less than total input size.
Loading all docs with codePointLimit of 101
doc 1 loaded: {document=this is document one}
doc 2 loaded: {document=this is document 2}
doc 3 loaded: {document=this is document three}

Test2. All should load, all docs are less than total input size - 1.
Loading all docs with codePointLimit of 100
doc 1 loaded: {document=this is document one}
doc 2 loaded: {document=this is document 2}
doc 3 failed: The incoming YAML document exceeds the limit: 100 code points.

Test3. All should load, all docs are less or equal to doc3 size.
Loading all docs with codePointLimit of 33
doc 1 loaded: {document=this is document one}
doc 2 failed: The incoming YAML document exceeds the limit: 33 code points.

Test4. Should fail to load at 3rd doc, it is longer than doc3 size -1.
Loading all docs with codePointLimit of 32
doc 1 loaded: {document=this is document one}
doc 2 failed: The incoming YAML document exceeds the limit: 32 code points.

Process finished with exit code 0

This reflects Petr’s report. Only Test1 delivered the expected results.

Retrying under SnakeYAML 2.1 (unreleased in GIT)

I did a local install of SnakeYAML 2.1 as of e46ff5a and reran to get the following output:

doc1 size: 31
doc2 size: 29
doc3 size: 33
input size:101

Test1. All should load, all docs are less than total input size.
Loading all docs with codePointLimit of 101
doc 1 loaded: {document=this is document one}
doc 2 loaded: {document=this is document 2}
doc 3 loaded: {document=this is document three}

Test2. All should load, all docs are less than total input size - 1.
Loading all docs with codePointLimit of 100
doc 1 loaded: {document=this is document one}
doc 2 loaded: {document=this is document 2}
doc 3 loaded: {document=this is document three}

Test3. All should load, all docs are less or equal to doc3 size.
Loading all docs with codePointLimit of 33
doc 1 loaded: {document=this is document one}
doc 2 loaded: {document=this is document 2}
doc 3 failed: The incoming YAML document exceeds the limit: 33 code points.

Test4. Should fail to load at 3rd doc, it is longer than doc3 size -1.
Loading all docs with codePointLimit of 32
doc 1 loaded: {document=this is document one}
doc 2 loaded: {document=this is document 2}
doc 3 failed: The incoming YAML document exceeds the limit: 32 code points.

Only Test3 seems off to me. Should it have successfully loaded doc3 as well?

Or maybe my Test is a bit off?

Comments (11)

  1. Lee Read reporter

    @Andrey Somov it seems my ignorance of YAML comes in handy sometimes! 🙂

    Do you think it would be useful for the tests to include a document where codepoint size does not equal string length? (or maybe they do and I missed it).

    To show what I mean:

    String s = "\uD83D\uDD06";
    System.out.println ("The length of " + s 
        + " is " + s.length()
        + ", its codepoint count is " + s.codePoints().count());
    

    Outputs:

    The length of 🔆 is 2, its codepoint count is 1
    

  2. Andrey Somov

    @Lee Read I did not quite catch you. Feel free to create a test.
    The issue here is about something else - the document boundary marker is ignored.
    (sorry, I did not have time to fix it yet - I assume the priority is low, the exact number of chars to limit is not very important)

  3. Lee Read reporter

    @Andrey Somov I was suggesting that doc size limit tests might also ensure that SnakeYAML is testing against the code point size of the doc instead of the string length of the doc for setCodePointLimit.

    I just ran the following against the SnakeYAML current master HEAD and it seems to be working as expected here:

    package org.example;
    
    import org.yaml.snakeyaml.LoaderOptions;
    import org.yaml.snakeyaml.Yaml;
    
    public class Main {
        public static void main(String[] args) {
            String sdoc = "document: doc length <> code point size due to these chars 🎉🚀👀☀️\n";
            System.out.println ("doc length: " + sdoc.length());
            System.out.println ("doc codepoint size: " + sdoc.codePoints().count());
    
            LoaderOptions loaderOpts = new LoaderOptions();
            loaderOpts.setCodePointLimit(65);
            Yaml yaml = new Yaml(loaderOpts);
    
            Object doc = yaml.load(sdoc);
            System.out.println ("loaded: " + doc);
        }
    }
    

    Outputs:

    doc length: 68
    doc codepoint size: 65
    loaded: {document=doc length <> code point size due to these chars 🎉🚀👀☀️}
    

    I hope this makes sense. Feel free to ignore this if it seems unnecessary or is already covered by existing tests.

  4. Log in to comment