13F8 is not mapped to 13F0 on JDK 9

Issue #1 closed
Markus KARG created an issue

On JDK 9 CaseFoldingTests fails at code point 13F8, as it apparently is not mapped to 13F0 (as requested by (CaseFolding-8.0.0.txt).

I assume that the explicit support for Unicode 8 in Java 9 causes this problem here, because that characters possibly simply had not been contained in Java 8.

Comments (15)

  1. Markus KARG reporter

    Looking at the implementation I think it is clear why the test breaks. As found on diverse web sites, case folding is not possible by simply chaining upper/lower operations. This works well for latin. But not for some other languages. With JDK 9 cherokee language was added -- which has exactly this problem. So the question is, what algorithm do we actually want for such a case?

  2. Christian Schudt repo owner

    I've spent some hours on this issue already, without a good solution.

    The test seems to pass for the cherokee code points, if toUpperCase() are applied to them only. This is really weird, because for every other code points, case folding works with toUpper().toLower() (as the test shows).

    Actually it feels like a bug in Java 9 then.

    I've tried many things with the Character class to detect cherokee codepoints, but with no success.

    We could also simply create a static mapping (by parsing the CaseFolding.txt) and use that mapping for case folding.

    At the moment I don't really care about Cherokee codepoints, so I'll leave this issue open.

  3. Markus KARG reporter

    Well, if it is a bug in Java 9 I could report it and we just wait for a fix. On the other hand, I am not a PRECIS guru, so can we be sure that it IS a bug?

  4. Markus KARG reporter

    I look how String.toLowerCase is implemented in Java 9 and it is pretty clear that the solution is error-prone: It contains hard-coded special cases for particular code points. I assume that they forgot to add more special cases for the new Unicode 8 code points. I will check it a bit deeper the next days, but it might need a while, as I need to prepare EclipseCon first.

  5. Markus KARG reporter

    As in my understanding it is clearly a bug in JDK 9 I just reported it to Oracle. Let's see what happens next.

  6. Markus KARG reporter

    Meanwhile Oracle confirmed that it is a bug. While I am confident that they will some day fix it, I do not think it will be any time soon. So I developed a workaround: The test case simply skips the cherokee unicode block. Works pretty well.

  7. Markus KARG reporter

    Meanwhile I got an answer from Oracle. There is no bug in the JRE actually, but it is definitively a bug in Precis!

    Quote from the Unicode 8.0 specification (page 156)

    ...the addition of lowercase Cherokee letters as of Version 8.0 of the
    Unicode Standard, together with the stability guarantees for case folding, require that
    Cherokee letters be case folded to their uppercase counterparts. As a result, a case folded
    string is not necessarily lowercase.
    

    As a result, we have three choices to be able to be able to compile on JDK 9 (or more precisely, on any JDK supporting Unicode 8.0).

    • Adopt my existing workaround which, i. e. simply skipping Cherokee symbols in the test case. This has the side effect that -at least for Cherokee symbols- the PrecisProfile.caseFold method is returning in incorrect result.
    • Keep the test as it is, but skip Cherokee symbols in the implementation of PrecisProfile.caseFold. Same negative effect, and the implementation will slow down a bit, as we have to check each single character whether it is Cherokee or not.
    • Implement a fully correct lookup to the mapping file as required per the above quote of the spec. This certainly results in a perfect result, but will either eat a lot of CPU cycles (live lookup in the file) or eat up a lot of RAM (preload and cach map in-memory).

    So the question is: What do you like me to do? For sake of correctness and performance, I'd go with option 3 and preload the file in a map at class loading. But it is your project, you have to decide what my PR shall do.

  8. Christian Schudt repo owner

    Markus, thanks for caring and checking this issue with Oracle. Really weird though.

    I'd also go with option 3.

    What about option 4?: Iterating over each character and check if there's a Cherokee char in the String? If not, do toUpperCase().toLowerCase() on the String. If yes, manually assemble the case folded string (treating Cherokee chars with toUpperCase() and every other char with toUpperCase().toLowerCase(). The downside is, that we have to create a String for each Character and also consider the surrogate pairs, which probably can get tricky.

  9. Markus KARG reporter

    Option 4 would work for now with Cherokee, but possibly break again in future once Unicode 9.0.0 again adopts the next strange set of rules for another ancient language... So will try to implement option 3 now. Shouldn't be too complex.

  10. Markus KARG reporter

    Implemented option 3. Works pretty well. Even more, it now provides FULL case folding according to Unicode 8.0. Class loading is a bit slow now (100ms) but should not be problem in the real world. See https://bitbucket.org/mkarg/precis/commits/0f00d2ed2ea872ebc85d311fdf2fbdaa2ddb74c6.

    I will send a pull request once I finished my current work on the module-info.class tests in Babbler (intentionally keeping the fix unmerged unless I am really sure that no other JDK 9 related precis fixes are needed).

  11. Christian Schudt repo owner

    Casefolding was used in RFC 7564 which is no longer used in RFC 8264.

    Instead of case folding, toLowerCase() is now used.

    See 850d123a

    Since then it compiles with JDK 9, too.

    I am closing this issue.

  12. Log in to comment