1. Andrey Somov
  2. snakeyaml
  3. Issues
Issue #323 resolved

Is there any reason to not support "Miscellaneous Symbols and Pictographs" unicode characters?

Anonymous created an issue

http://www.fileformat.info/info/unicode/block/miscellaneous_symbols_and_pictographs/list.htm

StreamReader.checkPrintable throws "special characters are not allowed" error during deserialization. But I am not sure why these characters are not printable, see 😏 ?

Comments (14)

  1. Alain Béarez

    If I understood correctly the Unicode terminology, it happens that surrogate characters should go by pairs, making it possible to reference a supplementary code point between U+10000 and U+10FFFF. http://www.unicode.org/glossary/#surrogate_code_point

    From further reading the Javadoc for the Character class, it appears that Java cannot represent directly the supplementary characters and thus makes use of the surrogate pairs: https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#unicode

    In order to accept the characters in range [#x10000-#x10FFFF] / 32 bit / from the YAML specification, the code has to check if the pair of chars, where the first char is in the surrogate range, is a valid surrogate pair: https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isSurrogatePair-char-char-

    This requires the parser to look one char ahead before rejecting the char as illegal. It may not be that easy, depending on the specific implementation. It might be around these lines: https://bitbucket.org/asomov/snakeyaml/src/e18bb04c65e5a93f4f72b3c81142d0afb615549f/src/main/java/org/yaml/snakeyaml/reader/StreamReader.java?at=default&fileviewer=file-view-default#StreamReader.java-63


    Surrogate Code Point. A Unicode code point in the range U+D800..U+DFFF. Reserved for use by UTF-16, where a pair of surrogate code units (a high surrogate followed by a low surrogate) “stand in” for a supplementary code point.

    Surrogate Pair. A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit, and the second is a low-surrogate code unit. (See definition D75 in Section 3.8, Surrogates.)

  2. Andrey Somov repo owner

    Well, I think you mix bytes and chars.

    • ASCII: 1 byte -> 1 char (character set - 128 combinations)
    • Windows1252: 1 byte -> 1 char (character set - 256 combinations)
    • UTF-8: 1-5 bytes -> 1 char (character set - 65 000 combinations ?)
    • UTF-16: 2 bytes -> 1 char (character set - 65 000 combinations ?)
    • UTF-32: 4 bytes -> 1 char (character set - 65 000 combinations ?)

    First you need to convert bytes to chars and then to analyse chars. You think that the error happens in the first step but it happens in the second. The first step is implemented in UnicodeReader.java

  3. Pawel Skierczynski

    I think you are mixing unicode code points and java characters. YAML specification is all about code points "All characters mentioned in this specification are Unicode code points". While talking about Unicode code points you don't talk about their representation yet (like if it's UTF-8 or UTF-16).

    Now "The allowed character range explicitly excludes the surrogate block" means it disallow code points in that range and not java characters in that range. And rightfully so, because they are meaningless all alone. They need second half and together will form perfectly valid printable codepoint (in range [#x10000-#x10FFFF]).

    I have a constructive proposal how to fix this, that I'm putting as a pull request. This change is not small I could have missed something so let's discuss it together in pull request.

  4. Andrey Somov repo owner

    Well, first of all thank you very much for the time you spent. This looks really impressive. Since the tests have become better I do not see any problem to take the change. There is one question though. The change is backwards-incompatible. Some methods are renamed. Let us see what the other developers say.

  5. Log in to comment