Support for normalization forms

Issue #12 resolved
Former user created an issue

Hi. I really like nunicode, it's excellent, and I'm using it in this project: github.com/RedisLabsModules/RediSearch

One thing I'm missing is the ability to do things like remove accents. This requires of course support for normalization forms.

Is this something you have planned?

Comments (5)

  1. Aleksey Tulinov repo owner

    Hi.

    There is no Unicode normalization forms in nunicode so far, but unaccenting would be a nice thing to have in library in my opinion. Although, i don't think this topic is standardized.

    Do i understand correctly that you're suggesting to put string into normalization form NFD or NFKD, then remove combining marks to get unaccented string? My question is: do you need proper NFD/NFKD or just need to remove diacritics from regular string of precomposed characters?

    If you need just diacritics removed, do you have any preference on the method of removal? For instance, could you take a look at Postgres' unaccenting rules: https://github.com/postgres/postgres/blob/master/contrib/unaccent/unaccent.rules and tell if this is what you would expect from unaccent() function or not.

    It's somewhat different from NFD/NFKD, for example, copyright sign © doesn't decompose into ( + combining mark + C + combining mark + ) in Unicode, but Postgres has that and several other exceptions: ®, ×, etc. So it's more like compatibility decomposition with special cases. As far as i understand, it's mostly defined by CLDR's Latin-ASCII transliteration rules, but also defined by other people adding exceptions (because Latin-ASCII transliteration don't include Cyrillic-ASCII transliteration, for example).

    Is that good or you would expect just decomposition (NFD) and removal of combining marks from source string?

  2. Aleksey Tulinov repo owner

    You could try fd7dd75 on master, although please note that master is Unicode 10 beta. Master introduces nu_tounaccent() which supposedly does unaccenting. This is not compliant to anything in Unicode, but presumably it should work more or less as expected for European languages. Some details on it are here: https://bitbucket.org/alekseyt/nunicode/overview#markdown-header-unaccenting .

    It's planned for release with nunicode 1.8 sometime this summer (presumably June, following release of Unicode 10).

  3. Log in to comment