13F8 is not mapped to 13F0 on JDK 9
On JDK 9 CaseFoldingTests fails at code point 13F8, as it apparently is not mapped to 13F0 (as requested by (CaseFolding-8.0.0.txt).
I assume that the explicit support for Unicode 8 in Java 9 causes this problem here, because that characters possibly simply had not been contained in Java 8.
Comments (15)
-
reporter -
reporter @sco0ter How to proceed?
-
repo owner I've spent some hours on this issue already, without a good solution.
The test seems to pass for the cherokee code points, if toUpperCase() are applied to them only. This is really weird, because for every other code points, case folding works with toUpper().toLower() (as the test shows).
Actually it feels like a bug in Java 9 then.
I've tried many things with the Character class to detect cherokee codepoints, but with no success.
We could also simply create a static mapping (by parsing the CaseFolding.txt) and use that mapping for case folding.
At the moment I don't really care about Cherokee codepoints, so I'll leave this issue open.
-
reporter Well, if it is a bug in Java 9 I could report it and we just wait for a fix. On the other hand, I am not a PRECIS guru, so can we be sure that it IS a bug?
-
repo owner I am also not sure, how "case folding" really works.
When I implemented it, I've found this (among others): https://docs.atlassian.com/jira/7.1.6/com/atlassian/jira/util/CaseFolding.html (trick with toLower(toUpper())).
and it indeed did the trick (at least for Java 8).
I am not sure, if it's a bug, but it's at least weird, that these new few code points behave different now.
-
reporter I look how String.toLowerCase is implemented in Java 9 and it is pretty clear that the solution is error-prone: It contains hard-coded special cases for particular code points. I assume that they forgot to add more special cases for the new Unicode 8 code points. I will check it a bit deeper the next days, but it might need a while, as I need to prepare EclipseCon first.
-
reporter As in my understanding it is clearly a bug in JDK 9 I just reported it to Oracle. Let's see what happens next.
-
reporter Meanwhile Oracle confirmed that it is a bug. While I am confident that they will some day fix it, I do not think it will be any time soon. So I developed a workaround: The test case simply skips the cherokee unicode block. Works pretty well.
-
reporter Meanwhile I got an answer from Oracle. There is no bug in the JRE actually, but it is definitively a bug in Precis!
Quote from the Unicode 8.0 specification (page 156)
...the addition of lowercase Cherokee letters as of Version 8.0 of the Unicode Standard, together with the stability guarantees for case folding, require that Cherokee letters be case folded to their uppercase counterparts. As a result, a case folded string is not necessarily lowercase.
As a result, we have three choices to be able to be able to compile on JDK 9 (or more precisely, on any JDK supporting Unicode 8.0).
- Adopt my existing workaround which, i. e. simply skipping Cherokee symbols in the test case. This has the side effect that -at least for Cherokee symbols- the
PrecisProfile.caseFold
method is returning in incorrect result. - Keep the test as it is, but skip Cherokee symbols in the implementation of
PrecisProfile.caseFold
. Same negative effect, and the implementation will slow down a bit, as we have to check each single character whether it is Cherokee or not. - Implement a fully correct lookup to the mapping file as required per the above quote of the spec. This certainly results in a perfect result, but will either eat a lot of CPU cycles (live lookup in the file) or eat up a lot of RAM (preload and cach map in-memory).
So the question is: What do you like me to do? For sake of correctness and performance, I'd go with option 3 and preload the file in a map at class loading. But it is your project, you have to decide what my PR shall do.
- Adopt my existing workaround which, i. e. simply skipping Cherokee symbols in the test case. This has the side effect that -at least for Cherokee symbols- the
-
repo owner Markus, thanks for caring and checking this issue with Oracle. Really weird though.
I'd also go with option 3.
What about option 4?: Iterating over each character and check if there's a Cherokee char in the String? If not, do
toUpperCase().toLowerCase()
on the String. If yes, manually assemble the case folded string (treating Cherokee chars withtoUpperCase()
and every other char withtoUpperCase().toLowerCase()
. The downside is, that we have to create a String for each Character and also consider the surrogate pairs, which probably can get tricky. -
reporter Option 4 would work for now with Cherokee, but possibly break again in future once Unicode 9.0.0 again adopts the next strange set of rules for another ancient language... So will try to implement option 3 now. Shouldn't be too complex.
-
reporter Implemented option 3. Works pretty well. Even more, it now provides FULL case folding according to Unicode 8.0. Class loading is a bit slow now (100ms) but should not be problem in the real world. See https://bitbucket.org/mkarg/precis/commits/0f00d2ed2ea872ebc85d311fdf2fbdaa2ddb74c6.
I will send a pull request once I finished my current work on the module-info.class tests in Babbler (intentionally keeping the fix unmerged unless I am really sure that no other JDK 9 related precis fixes are needed).
-
repo owner -
repo owner - changed status to resolved
-
repo owner - changed status to closed
- Log in to comment
Looking at the implementation I think it is clear why the test breaks. As found on diverse web sites, case folding is not possible by simply chaining upper/lower operations. This works well for latin. But not for some other languages. With JDK 9 cherokee language was added -- which has exactly this problem. So the question is, what algorithm do we actually want for such a case?