13F8 is not mapped to 13F0 on JDK 9

Issue #1 closed

Markus KARG created an issue 2017-10-16

On JDK 9 CaseFoldingTests fails at code point 13F8, as it apparently is not mapped to 13F0 (as requested by (CaseFolding-8.0.0.txt).

I assume that the explicit support for Unicode 8 in Java 9 causes this problem here, because that characters possibly simply had not been contained in Java 8.

Comments (15)

Markus KARG reporter
Looking at the implementation I think it is clear why the test breaks. As found on diverse web sites, case folding is not possible by simply chaining upper/lower operations. This works well for latin. But not for some other languages. With JDK 9 cherokee language was added -- which has exactly this problem. So the question is, what algorithm do we actually want for such a case?
- 2017-10-17T20:44:12+00:00
Markus KARG reporter
@sco0ter How to proceed?
- 2017-10-18T20:33:27+00:00
Christian Schudt repo owner
I've spent some hours on this issue already, without a good solution.

The test seems to pass for the cherokee code points, if toUpperCase() are applied to them only. This is really weird, because for every other code points, case folding works with toUpper().toLower() (as the test shows).

Actually it feels like a bug in Java 9 then.

I've tried many things with the Character class to detect cherokee codepoints, but with no success.

We could also simply create a static mapping (by parsing the CaseFolding.txt) and use that mapping for case folding.

At the moment I don't really care about Cherokee codepoints, so I'll leave this issue open.
- 2017-10-18T20:57:02+00:00
Markus KARG reporter
Well, if it is a bug in Java 9 I could report it and we just wait for a fix. On the other hand, I am not a PRECIS guru, so can we be sure that it IS a bug?
- 2017-10-18T21:37:11+00:00
Christian Schudt repo owner
I am also not sure, how "case folding" really works.

When I implemented it, I've found this (among others): https://docs.atlassian.com/jira/7.1.6/com/atlassian/jira/util/CaseFolding.html (trick with toLower(toUpper())).

and it indeed did the trick (at least for Java 8).

I am not sure, if it's a bug, but it's at least weird, that these new few code points behave different now.
- 2017-10-18T21:53:56+00:00
Markus KARG reporter
I look how String.toLowerCase is implemented in Java 9 and it is pretty clear that the solution is error-prone: It contains hard-coded special cases for particular code points. I assume that they forgot to add more special cases for the new Unicode 8 code points. I will check it a bit deeper the next days, but it might need a while, as I need to prepare EclipseCon first.
- 2017-10-18T21:56:47+00:00
Markus KARG reporter
As in my understanding it is clearly a bug in JDK 9 I just reported it to Oracle. Let's see what happens next.
- 2017-10-19T21:24:24+00:00
Markus KARG reporter
Meanwhile Oracle confirmed that it is a bug. While I am confident that they will some day fix it, I do not think it will be any time soon. So I developed a workaround: The test case simply skips the cherokee unicode block. Works pretty well.
- 2017-11-04T19:18:38+00:00
Markus KARG reporter
Meanwhile I got an answer from Oracle. There is no bug in the JRE actually, but it is definitively a bug in Precis!

Quote from the Unicode 8.0 specification (page 156)
```
...the addition of lowercase Cherokee letters as of Version 8.0 of the
Unicode Standard, together with the stability guarantees for case folding, require that
Cherokee letters be case folded to their uppercase counterparts. As a result, a case folded
string is not necessarily lowercase.
```
As a result, we have three choices to be able to be able to compile on JDK 9 (or more precisely, on any JDK supporting Unicode 8.0).
- Adopt my existing workaround which, i. e. simply skipping Cherokee symbols in the test case. This has the side effect that -at least for Cherokee symbols- the PrecisProfile.caseFold method is returning in incorrect result.
- Keep the test as it is, but skip Cherokee symbols in the implementation of PrecisProfile.caseFold. Same negative effect, and the implementation will slow down a bit, as we have to check each single character whether it is Cherokee or not.
- Implement a fully correct lookup to the mapping file as required per the above quote of the spec. This certainly results in a perfect result, but will either eat a lot of CPU cycles (live lookup in the file) or eat up a lot of RAM (preload and cach map in-memory).
So the question is: What do you like me to do? For sake of correctness and performance, I'd go with option 3 and preload the file in a map at class loading. But it is your project, you have to decide what my PR shall do.
- 2017-11-30T18:50:30+00:00
Christian Schudt repo owner
Markus, thanks for caring and checking this issue with Oracle. Really weird though.

I'd also go with option 3.

What about option 4?: Iterating over each character and check if there's a Cherokee char in the String? If not, do toUpperCase().toLowerCase() on the String. If yes, manually assemble the case folded string (treating Cherokee chars with toUpperCase() and every other char with toUpperCase().toLowerCase(). The downside is, that we have to create a String for each Character and also consider the surrogate pairs, which probably can get tricky.
- 2017-11-30T19:09:39+00:00
Markus KARG reporter
Option 4 would work for now with Cherokee, but possibly break again in future once Unicode 9.0.0 again adopts the next strange set of rules for another ancient language... So will try to implement option 3 now. Shouldn't be too complex.
- 2017-12-01T07:39:23+00:00
Markus KARG reporter
Implemented option 3. Works pretty well. Even more, it now provides FULL case folding according to Unicode 8.0. Class loading is a bit slow now (100ms) but should not be problem in the real world. See https://bitbucket.org/mkarg/precis/commits/0f00d2ed2ea872ebc85d311fdf2fbdaa2ddb74c6.

I will send a pull request once I finished my current work on the module-info.class tests in Babbler (intentionally keeping the fix unmerged unless I am really sure that no other JDK 9 related precis fixes are needed).
- 2017-12-02T16:50:10+00:00
Christian Schudt repo owner
Casefolding was used in RFC 7564 which is no longer used in RFC 8264.

Instead of case folding, toLowerCase() is now used.

See 850d123a

Since then it compiles with JDK 9, too.

I am closing this issue.
- 2019-07-01T08:22:15+00:00
Christian Schudt repo owner
- changed status to resolved
- 2019-07-01T08:22:26+00:00
Christian Schudt repo owner
- changed status to closed
- 2019-07-01T08:22:34+00:00
Log in to comment

Assignee: Christian Schudt

Type: bug

Priority: major

Status: closed

Votes: 0

Watchers: 2

Comments (15)

Quote from the Unicode 8.0 specification (page 156)