Potentially problematic regular expression `INT` for matching integer literals

Issue #537 resolved
Petr Pucil created an issue

There might be some issue with this regular expression (apparently introduced in commit https://bitbucket.org/snakeyaml/snakeyaml/commits/b9a36efa392dc728fcf99589311fd7c0898680cd and released in version 1.30) - it may potentially be less efficient than it should: https://bitbucket.org/snakeyaml/snakeyaml/annotate/49227c24d741dafe3b519a0d5f9d8413d803d9f3/src/main/java/org/yaml/snakeyaml/resolver/Resolver.java?at=snakeyaml-1.30#Resolver.java-48:54

I used System.out.println() to get the entire regex:

System.out.println("^(?:" +
    "[-+]?0b_*[0-1]+[0-1_]*" + // (base 2)
    "|[-+]?0_*[0-7]+[0-7_]*" + // (base 8)
    "|[-+]?(?:0|[1-9][0-9_]*)" + // (base 10)
    "|[-+]?0x_*[0-9a-fA-F]+[0-9a-fA-F_]*" + // (base 16)
    "|[-+]?[1-9][0-9_]*(?::[0-5]?[0-9])+" + // (base 60)
    ")$");

Output:

^(?:[-+]?0b_*[0-1]+[0-1_]*|[-+]?0_*[0-7]+[0-7_]*|[-+]?(?:0|[1-9][0-9_]*)|[-+]?0x_*[0-9a-fA-F]+[0-9a-fA-F_]*|[-+]?[1-9][0-9_]*(?::[0-5]?[0-9])+)$

Then I gave it to the tool https://github.com/NicolaasWeideman/RegexStaticAnalysis, and it wasn’t very happy about it (the full output.txt is attached):

$ echo '^(?:[-+]?0b_*[0-1]+[0-1_]*|[-+]?0_*[0-7]+[0-7_]*|[-+]?(?:0|[1-9][0-9_]*)|[-+]?0x_*[0-9a-fA-F]+[0-9a-fA-F_]*|[-+]?[1-9][0-9_]*(?::[0-5]?[0-9])+)$' | ./run.sh &> output.txt

The conclusion of the RegexStaticAnalysis tool (see the end of output.txt) about the regex is as follows:

NFA constructed in: 53ms
EDA analysis performed in: 74ms
Does not contain EDA
IDA analysis performed in: 1924ms
Contains IDA, degree 1, with: **TIMEOUT**
    Did not construct IDA exploit string
Total analysis time: 2051

According to https://github.com/NicolaasWeideman/RegexStaticAnalysis#motivation, IDA stands for “infinite degree of amgibuity”.

I’ll rather use SnakeYAML 1.29 until this is sorted out.

Comments (10)

  1. Andrey Somov

    This is exactly the reason not to use it SnakeYAML Engine.
    what is your proposal ? How to implement the implicit integers and stay compliant with the spec 1.1 ?

  2. Petr Pucil reporter

    what is your proposal ? How to implement the implicit integers and stay compliant with the spec 1.1 ?

    I think the idea of that regex is fine, it’s just that it’s written in such an ambiguous way which allows the regex engine to match an input string in multiple ways (i.e. one character can be captured by one group but also by another group), and that’s never a good thing because regex engines may often choke on longer inputs with such regex.

    If you want me to fix it, sure, I can take a look.

  3. Petr Pucil reporter

    It’s quite trivial to fix - apparently it’s just about the extra pluses + that have absolutely no reason to be there and only add the problematic ambiguity to the regular expression. Here’s the fixed regex:

    System.out.println("^(?:" +
        "[-+]?0b_*[0-1][0-1_]*" + // (base 2)
        "|[-+]?0_*[0-7][0-7_]*" + // (base 8)
        "|[-+]?(?:0|[1-9][0-9_]*)" + // (base 10)
        "|[-+]?0x_*[0-9a-fA-F][0-9a-fA-F_]*" + // (base 16)
        "|[-+]?[1-9][0-9_]*(?::[0-5]?[0-9])+" + // (base 60)
        ")$");
    

    Note that I removed three redundant + quantifiers - see the comparison with the existing version at https://editor.mergely.com/:

    The mentioned tool RegexStaticAnalysis marks this new regex as safe, which is a good thing:

    Enter a regular expression to analyze:
    1. pattern = "^(?:[-+]?0b_*[0-1][0-1_]*|[-+]?0_*[0-7][0-7_]*|[-+]?(?:0|[1-9][0-9_]*)|[-+]?0x_*[0-9a-fA-F][0-9a-fA-F_]*|[-+]?[1-9][0-9_]*(?::[0-5]?[0-9])+)$"
    NFA constructed in: 53ms
    EDA analysis performed in: 64ms
    Does not contain EDA
    IDA analysis performed in: 553ms
    Does not contain IDA
    Total analysis time: 670
    Enter a regular expression to analyze:
    Analysed:       1/1
            Safe:           1/1
            Vulnerable:     0/1
                    EDA:            0/1
                    IDA:            0/1
            Vulnerable EDA: []
            Vulnerable IDA: []
    Skipped:        0/1
    Timeout:        0/1
                    EDA:    0/1
                    IDA:    0/1
    Total running time: 672
    

    However, you can check for yourself that if you leave even one of the 3 unnecessary pluses there, it will not be declared safe.

  4. Petr Pucil reporter

    Is it solved ? can we close the issue ?

    @Andrey Somov Yes, thank you for dealing with this so quickly.

  5. Log in to comment