Is there any reason to not support "Miscellaneous Symbols and Pictographs" unicode characters?

Issue #323 resolved

Former user created an issue 2015-10-10

http://www.fileformat.info/info/unicode/block/miscellaneous_symbols_and_pictographs/list.htm

StreamReader.checkPrintable throws "special characters are not allowed" error during deserialization. But I am not sure why these characters are not printable, see 😏 ?

Comments (17)

Andrey Somov
It was already reported earlier: https://code.google.com/p/snakeyaml/issues/detail?id=205

If you have a solution feel free to make a proposal.
- 2015-10-11T18:24:52+00:00
Andrey Somov
To repeat it here: The YAML spec (http://yaml.org/spec/1.1/#id868524) explicitly excludes surrogates (https://en.wikipedia.org/wiki/Universal_Character_Set_characters#Surrogates)
- 2015-10-13T08:49:24+00:00
Former user Account Deleted
If I understood correctly the Unicode terminology, it happens that surrogate characters should go by pairs, making it possible to reference a supplementary code point between U+10000 and U+10FFFF. http://www.unicode.org/glossary/#surrogate_code_point

From further reading the Javadoc for the Character class, it appears that Java cannot represent directly the supplementary characters and thus makes use of the surrogate pairs: https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#unicode

In order to accept the characters in range [#x10000-#x10FFFF] / 32 bit / from the YAML specification, the code has to check if the pair of chars, where the first char is in the surrogate range, is a valid surrogate pair: https://docs.oracle.com/javase/8/docs/api/java/lang/Character.html#isSurrogatePair-char-char-

This requires the parser to look one char ahead before rejecting the char as illegal. It may not be that easy, depending on the specific implementation. It might be around these lines: https://bitbucket.org/asomov/snakeyaml/src/e18bb04c65e5a93f4f72b3c81142d0afb615549f/src/main/java/org/yaml/snakeyaml/reader/StreamReader.java?at=default&fileviewer=file-view-default#StreamReader.java-63

Surrogate Code Point. A Unicode code point in the range U+D800..U+DFFF. Reserved for use by UTF-16, where a pair of surrogate code units (a high surrogate followed by a low surrogate) “stand in” for a supplementary code point.

Surrogate Pair. A representation for a single abstract character that consists of a sequence of two 16-bit code units, where the first value of the pair is a high-surrogate code unit, and the second is a low-surrogate code unit. (See definition D75 in Section 3.8, Surrogates.)
- 2015-10-27T11:44:50+00:00
Andrey Somov
Well, I think you mix bytes and chars.
- ASCII: 1 byte -> 1 char (character set - 128 combinations)
- Windows1252: 1 byte -> 1 char (character set - 256 combinations)
- UTF-8: 1-5 bytes -> 1 char (character set - 65 000 combinations ?)
- UTF-16: 2 bytes -> 1 char (character set - 65 000 combinations ?)
- UTF-32: 4 bytes -> 1 char (character set - 65 000 combinations ?)
First you need to convert bytes to chars and then to analyse chars. You think that the error happens in the first step but it happens in the second. The first step is implemented in UnicodeReader.java
- 2015-10-27T13:45:00+00:00
Paweł Skierczynski
I think you are mixing unicode code points and java characters. YAML specification is all about code points "All characters mentioned in this specification are Unicode code points". While talking about Unicode code points you don't talk about their representation yet (like if it's UTF-8 or UTF-16).

Now "The allowed character range explicitly excludes the surrogate block" means it disallow code points in that range and not java characters in that range. And rightfully so, because they are meaningless all alone. They need second half and together will form perfectly valid printable codepoint (in range [#x10000-#x10FFFF]).

I have a constructive proposal how to fix this, that I'm putting as a pull request. This change is not small I could have missed something so let's discuss it together in pull request.
- 2016-09-01T16:35:21+00:00
Andrey Somov
Well, first of all thank you very much for the time you spent. This looks really impressive. Since the tests have become better I do not see any problem to take the change. There is one question though. The change is backwards-incompatible. Some methods are renamed. Let us see what the other developers say.
- 2016-09-02T13:56:35+00:00
Alexander Maslov
Looks good to me. This kind of "backward-incompatibilities" are acceptable from my side.
- 2016-09-05T06:44:40+00:00
Andrey Somov
- changed status to resolved
Merged in pskierczynski/snakeyaml/323 handle surrogate code points (pull request #12)

Fix issue 323: Handle surrogate codepoints

→ <<cset 9bb4f02c31e0>>
- 2016-09-05T09:52:10+00:00
Andrey Somov
Fix issue 323: Handle surrogate codepoints

→ <<cset 2057e78650b9>>
- 2016-09-05T09:52:10+00:00
Andrey Somov
Merged in pskierczynski/snakeyaml/323 handle surrogate code points (pull request #12)

Fix issue 323: Handle surrogate codepoints

→ <<cset 9bb4f02c31e0>>
- 2016-09-05T09:52:10+00:00
Charles Nutter
Can we get a release? We just had this bug reported in JRuby. https://github.com/jruby/jruby/issues/4492
- 2017-02-16T19:43:11+00:00
Andrey Somov
We are releasing... ;-)
- 2017-02-17T08:32:51+00:00
Charles Nutter
Thanks Andrey!
- 2017-02-21T15:45:58+00:00
Charles Nutter
I still only see 1.17 in maven central. I need 1.18. Did I misunderstand?
- 2017-02-21T18:27:42+00:00
Andrey Somov
release 1.18 is out. It is already available in central.
- 2017-02-22T14:58:04+00:00
Charles Nutter
Thanks again!
- 2017-02-23T15:27:14+00:00
Former user Account Deleted
For reference, a write up on various things char, string and unicode/ utf8 related:

https://zenaan.github.io/zen/javadoc/zen/lang/string.html
- 2017-04-04T02:25:34+00:00
Log in to comment

Assignee: Andrey Somov

Type: proposal

Priority: critical

Status: resolved

Votes: 1

Watchers: 2