Regression in handling number-like strings with leading zeros

Issue #550 invalid
Michael Glaesemann created an issue

It looks like there was a regression between release 1.29 and 1.30 with respect to handling number-like strings with leading zeros.

% cat NumberLikeString.java                                                                                
package com.example;

import org.yaml.snakeyaml.Yaml;

class NumberLikeString {
    public static void main(String[] args) {
        String data = args[0];
        Yaml yaml = new Yaml();
        String output = yaml.dump(data);
        System.out.print(output);
    }
}
% java -classpath $HOME/.m2/repository/org/yaml/snakeyaml/1.29/snakeyaml-1.29.jar NumberLikeString.java 083
'083'
% java -classpath $HOME/.m2/repository/org/yaml/snakeyaml/1.30/snakeyaml-1.30.jar NumberLikeString.java 083
083
 % java -classpath $HOME/.m2/repository/org/yaml/snakeyaml/1.29/snakeyaml-1.29.jar NumberLikeString.java 123
'123'
% java -classpath $HOME/.m2/repository/org/yaml/snakeyaml/1.30/snakeyaml-1.30.jar NumberLikeString.java 123
'123'

% java -classpath $HOME/.m2/repository/org/yaml/snakeyaml/1.32/snakeyaml-1.32.jar NumberLikeString.java 083
083

The string is quoted in 1.29 (as expected), but not quoted in 1.30. The bug is present in 1.32 (the current release) as well.

Comments (24)

  1. Michiel Borkent

    I have tested this by converting EDN input (if you quint it is JSON):

    {:phone-number-prefixes ["045" "083" "083567" "064727"]}
    

    Converting this using clj-yaml which now uses v1.31 of snakeyaml:

    $ clj -Sdeps '{:deps {clj-commons/clj-yaml {:mvn/version "0.7.110"} org.yaml/snakeyaml {:mvn/version "1.31"}}}}' -M -e "(require '[clj-yaml.core :as yaml]) (print (yaml/generate-string (clojure.edn/read-string (slurp \"/tmp/test.edn\"))))"
    

    Output:

    phone-number-prefixes: ['045', 083, 083567, '064727']
    

    Some phone number prefixes are preserved as string, as they should be. But the ones starting with 08 are converted to numbers. This doesn’t make sense, no matter what spec you come up with.

  2. Andrey Somov

    Please be careful when you “it does not make sense“. Do you mean you do not care about the spec if does not meet your expectations ?

    You can very easily solve your issue providing your own Resolver. You can find many examples in the tests.

  3. Michiel Borkent

    Fair enough. I should have said: I really have trouble seeing when one would want this kind of behavior… but more severely: this is a breaking change in the downstream clj-yaml library and I think that library should have this custom resolver by default to not cause more breaking changes. Breaking changes are taken very seriously in the Clojure ecosystem.

  4. Michael Glaesemann reporter

    Andrey, I’m having a hard time reading the spec (https://yaml.org/spec/1.1/) , and I know you have a lot more experience here than I do. Would you share the portions of the YAML 1.1 spec that specifies the behavior of presenting/emitting a Java number-like string (e.g., "083”) without quotes? Or otherwise supports the change in behavior from 1.29 to 1.30 where the Java string "083" is dumped as '083' in 1.29 and 083 in 1.30?

  5. Michael Glaesemann reporter

    Let me provide a bit more context. Yaml is often (and in my personal use cases, nearly always) used as an interchange format, so behavior across Yaml processors is important. I see that snakeyaml does appear to be self-consistent, round-tripping Java “083” as a string.

    % cat NumberLikeString.java 
    package com.example;
    
    import org.yaml.snakeyaml.Yaml;
    
    class NumberLikeString {
        public static void main(String[] args) {
            Yaml yaml = new Yaml();
            String data = args[0];
            String output = yaml.dump(data);
            System.out.print(output);
            try {
                String reReadString = (String) yaml.load(output);
                System.out.println("re-read as string");
            } catch (Throwable t) {
                System.out.format("failed to re-read as string: %s: %s", t.getClass().getSimpleName(), t.getMessage());
            }
            try {
                Integer reReadNumber = (Integer) yaml.load(output);
                System.out.println("re-read as Integer");
            } catch (Throwable t) {
                System.out.format("failed to re-read as Integer: %s: %s", t.getClass().getSimpleName(), t.getMessage());
            }
        }
    }
    
    % java -classpath $HOME/.m2/repository/org/yaml/snakeyaml/1.29/snakeyaml-1.29.jar NumberLikeString.java 083
    '083'
    re-read as string
    failed to re-read as Integer: ClassCastException: class java.lang.String cannot be cast to class java.lang.Integer (java.lang.String and java.lang.Integer are in module java.base of loader 'bootstrap')
    
     % java -classpath $HOME/.m2/repository/org/yaml/snakeyaml/1.32/snakeyaml-1.32.jar NumberLikeString.java 083
    083
    re-read as string
    failed to re-read as Integer: ClassCastException: class java.lang.String cannot be cast to class java.lang.Integer (java.lang.String and java.lang.Integer are in module java.base of loader 'bootstrap')
    

    When interacting with other libraries, I’m seeing those unquoted values read as numbers rather than strings.

  6. Michiel Borkent

    If one wants to write “id: …” in YAML and the id can be any combination of numbers as string, including 0, e.g. id: 012345, how does one represent that in yaml 1.1, without converting it to an octal?

  7. Michael Glaesemann reporter

    I see a number of different points being discussed here:

    1. whether snakeyaml conforms to the YAML 1.1 spec
    2. whether YAML produced by snakeyaml round-trips through snakeyaml (whether snakeyaml can consume what snakeyaml produces with fidelity)
    3. what snakeyaml parses. For example, how snakeyaml interpets '083', '073', 083, and 073.
    4. what snakeyaml emits. For example, how snakeyaml dumps the string values "083" and "073".

    There may be other points being discussed that I’ve missed in my enumeration. As these are all closely related, its understandable how they might be discussed together. I myself have added information about a number of these different points, and I think that has both confused the actual issue at hand and helped clarify it for me.

    To be clear, I want snakeyaml to conform to the YAML 1.1 spec (point 1), and I want snakeyaml to be able to round-trip data (point 2). For the purposes of this issue I've reported, I'm not concerned with what snakeyaml parses (point 3). What I am concerned with is what snakeyaml emits, how that has changed, and how snakeyaml output is consumed by other yaml processors.

    The issue is that snakeyaml changed its behavior between 1.29 and 1.30. The string "083" is emitted as '083' in 1.29 and as 083 in 1.30.

    As far as I can tell, both of these behaviors conforms with the YAML 1.1 spec. Because 083 cannot be a valid octal value (because it includes an 8 digit), it can only be parsed as a string, so there's no ambiguity there.

    I believe '083' is also valid. I don't see anything in the spec that requires omitting quotes. Please do correct me if I'm wrong, pointing out where the YAML 1.1 spec requires absolute parsimony with respect to quotes.

    That said, it's often desirable behavior to omit unnecessary quotes, particularly when YAML is generated and consumed by humans. It's nice not having files littered with quote punction when it's unambiguous that the value is a string.

    In the absence of a dictum that quotes must be omitted (and let me be clear, if the YAML 1.1 specification requires omitting unrequired quotes, please do disregard all of this, as I believe specification adherence is your current highest value), I can think there are three reasons why it's desirable to maintain the snakeyaml 1.29 behavior of quoting numeric strings going forward.

    1. From a human readability perspective, it's not immediately obvious why 073 must be quoted and 083 is not. Requiring this goes against the principle of least astonishment, which is a desirable trait for usability.
    2. While snakeyaml is a YAML 1.1 spec-compliant YAML processor, this change in behavior breaks its output to be consumed by YAML 1.2 processors. Numeric strings with leading 0s are not considered octal, and the leading 0 is just dropped and the value interpreted as a base-10 number. While one might argue "use a YAML 1.2 processor to serialize your data if you're using a YAML 1.2 processor to consume it", unnecessarily breaking upwards compatibility is undesirable. And, as far as I can find, there's no mechanism for specifying the spec version for a given YAML file. How can one know what spec version an arbitrary YAML file when you don't know its provenance? Edit: Reading more, I believe directives are the method for specifying version: https://yaml.org/spec/1.1/#YAML directive/
    3. It changes (and breaks) the behavior of the library in a way that existing users have been using successfully. This is arguably an extension of reason 2, though as a matter of principle, unless it's in service of a bug fix, behavior that breaks existing users should be avoided, and valuable in and of itself.

    From a library consumer perspective, the behavior change from 1.29 to 1.30 breaks my existing use-cases, and from what I can see, unnecessarily. From a library producer perspective, I believe you value the experience of those who use your software. As a library producer myself, I know I value the experience of my users, and when I'm updating my libraries, I want to take into account how those changes will affect those users. If it's having negative effects, I want to make sure those changes are well balanced against those negative effects.

    I do want to thank you for continuing to maintain the snakeyaml library. I wouldn't be spending time with these comments if I didn't find snakeyaml useful and didn't care about the project itself. Thanks for taking the time to read and consider these comments.

  8. Andrey Somov

    @Michael Glaesemann to avoid any misunderstanding - if you think that something should be now changed in SnakeYAML, feel free to create a test or a PR.
    Otherwise it will stay as it is. In the mean time, you can evaluate SnakeYAML Engine, which only converts the types supported in JSON.

  9. vijaya kumar

    I have exactly same problem .. AWS account id (string with quotes) ‘0123456789’ is dumped as ( considered integer without quotes ) 0123456789 into yaml file . A tool further read AWS account id as 123456789 omitting 0 and fails .

    I think @Michael Glaesemann is right . This is BWC @Andrey Somov as for as snakeYaml adopters. At least this should be considered as a valid issue and fix it rather close it as invalid.

  10. vijaya kumar

    @Michael Michael @Andrey Somov your comment “'feel free to create a test or a PR.”

    There should be an accepted Issue first , not closing as invalid for someone to work on it .. right?

    first of all, this should have been mentioned as BWC for adopter and provided example customResolver to overcome it, majority of adopter are not java developers.

  11. Michael Glaesemann reporter

    @vijaya kumar : While you and I (and others) may have opinions regarding the severity of this issue, or whether it's even an issue at all, we are not maintainers of the project.

    It's up to the maintainers to decide for themselves what changes should be made to snakeyaml. It's their project. It behooves us as consumers of snakeyaml to remember that, and to understand the project from their perspective. Hopefully sharing our experiences is well received by the maintainers. Telling them what they should or should not do is not productive.

  12. Andrey Somov

    Kumar: feel free to create a PR together with an issue. But not an issue alone - we already have enough.

    Contributions are welcome.

    I have no idea what is “BWC”

  13. Lee Read

    Not sure if this is interesting to anyone, but thought I would share my understanding of why the change occurred:

    Snakeyaml 1.29 had a bug where it would resolve 083 as a float. To avoid string 083 being interpreted as what it thought was a float, snakeyaml would wrap it in single quotes when emitting.

    Snakeyaml 1.30 corrected this bug, it now correctly sees 083 as a string. Snakeyaml will only emit strings wrapped in quotes when necessary, for example when the string starts with a leading space. This is not the case for 083 so it is emitted as is.

  14. Log in to comment