Is there any reason to not support "Miscellaneous Symbols and Pictographs" unicode characters?
http://www.fileformat.info/info/unicode/block/miscellaneous_symbols_and_pictographs/list.htm
StreamReader.checkPrintable throws "special characters are not allowed" error during deserialization. But I am not sure why these characters are not printable, see 😏 ?
Comments (17)
-
-
To repeat it here: The YAML spec (http://yaml.org/spec/1.1/#id868524) explicitly excludes surrogates (https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates)
-
Account Deleted If I understood correctly the Unicode terminology, it happens that surrogate characters should go by pairs, making it possible to reference a supplementary code point between U+10000 and U+10FFFF. http://www.unicode.org/glossary/#surrogate_code_point
From further reading the Javadoc for the Character class, it appears that Java cannot represent directly the supplementary characters and thus makes use of the surrogate pairs: https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#unicode
In order to accept the characters in range [#x10000-#x10FFFF] / 32 bit / from the YAML specification, the code has to check if the pair of chars, where the first char is in the surrogate range, is a valid surrogate pair: https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isSurrogatePair-char-char-
This requires the parser to look one char ahead before rejecting the char as illegal. It may not be that easy, depending on the specific implementation. It might be around these lines: https://bitbucket.org/asomov/snakeyaml/src/e18bb04c65e5a93f4f72b3c81142d0afb615549f/src/main/java/org/yaml/snakeyaml/reader/StreamReader.java?at=default&fileviewer=file-view-default#StreamReader.java-63
Surrogate Code Point. A Unicode code point in the range U+D800..U+DFFF. Reserved for use by UTF-16, where a pair of surrogate code units (a high surrogate followed by a low surrogate) “stand in” for a supplementary code point.
Surrogate Pair. A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit, and the second is a low-surrogate code unit. (See definition D75 in Section 3.8, Surrogates.)
-
Well, I think you mix bytes and chars.
- ASCII: 1 byte -> 1 char (character set - 128 combinations)
- Windows1252: 1 byte -> 1 char (character set - 256 combinations)
- UTF-8: 1-5 bytes -> 1 char (character set - 65 000 combinations ?)
- UTF-16: 2 bytes -> 1 char (character set - 65 000 combinations ?)
- UTF-32: 4 bytes -> 1 char (character set - 65 000 combinations ?)
First you need to convert bytes to chars and then to analyse chars. You think that the error happens in the first step but it happens in the second. The first step is implemented in UnicodeReader.java
-
I think you are mixing unicode code points and java characters. YAML specification is all about code points "All characters mentioned in this specification are Unicode code points". While talking about Unicode code points you don't talk about their representation yet (like if it's UTF-8 or UTF-16).
Now "The allowed character range explicitly excludes the surrogate block" means it disallow code points in that range and not java characters in that range. And rightfully so, because they are meaningless all alone. They need second half and together will form perfectly valid printable codepoint (in range [#x10000-#x10FFFF]).
I have a constructive proposal how to fix this, that I'm putting as a pull request. This change is not small I could have missed something so let's discuss it together in pull request.
-
Well, first of all thank you very much for the time you spent. This looks really impressive. Since the tests have become better I do not see any problem to take the change. There is one question though. The change is backwards-incompatible. Some methods are renamed. Let us see what the other developers say.
-
Looks good to me. This kind of "backward-incompatibilities" are acceptable from my side.
-
- changed status to resolved
Merged in pskierczynski/snakeyaml/323 handle surrogate code points (pull request #12)
Fix issue 323: Handle surrogate codepoints
→ <<cset 9bb4f02c31e0>>
-
Fix issue 323: Handle surrogate codepoints
→ <<cset 2057e78650b9>>
-
Merged in pskierczynski/snakeyaml/323 handle surrogate code points (pull request #12)
Fix issue 323: Handle surrogate codepoints
→ <<cset 9bb4f02c31e0>>
-
Can we get a release? We just had this bug reported in JRuby. https://github.com/jruby/jruby/issues/4492
-
We are releasing... ;-)
-
Thanks Andrey!
-
I still only see 1.17 in maven central. I need 1.18. Did I misunderstand?
-
release 1.18 is out. It is already available in central.
-
Thanks again!
-
Account Deleted For reference, a write up on various things char, string and unicode/ utf8 related:
- Log in to comment
It was already reported earlier: https://code.google.com/p/snakeyaml/issues/detail?id=205
If you have a solution feel free to make a proposal.