Support for normalization forms
Hi. I really like nunicode, it's excellent, and I'm using it in this project: github.com/RedisLabsModules/RediSearch
One thing I'm missing is the ability to do things like remove accents. This requires of course support for normalization forms.
Is this something you have planned?
Comments (5)
-
-
repo owner Hi.
There is no Unicode normalization forms in nunicode so far, but unaccenting would be a nice thing to have in library in my opinion. Although, i don't think this topic is standardized.
Do i understand correctly that you're suggesting to put string into normalization form NFD or NFKD, then remove combining marks to get unaccented string? My question is: do you need proper NFD/NFKD or just need to remove diacritics from regular string of precomposed characters?
If you need just diacritics removed, do you have any preference on the method of removal? For instance, could you take a look at Postgres' unaccenting rules: https://github.com/postgres/postgres/blob/master/contrib/unaccent/unaccent.rules and tell if this is what you would expect from unaccent() function or not.
It's somewhat different from NFD/NFKD, for example, copyright sign © doesn't decompose into ( + combining mark + C + combining mark + ) in Unicode, but Postgres has that and several other exceptions: ®, ×, etc. So it's more like compatibility decomposition with special cases. As far as i understand, it's mostly defined by CLDR's Latin-ASCII transliteration rules, but also defined by other people adding exceptions (because Latin-ASCII transliteration don't include Cyrillic-ASCII transliteration, for example).
Is that good or you would expect just decomposition (NFD) and removal of combining marks from source string?
-
repo owner You could try fd7dd75 on master, although please note that master is Unicode 10 beta. Master introduces
nu_tounaccent()
which supposedly does unaccenting. This is not compliant to anything in Unicode, but presumably it should work more or less as expected for European languages. Some details on it are here: https://bitbucket.org/alekseyt/nunicode/overview#markdown-header-unaccenting .It's planned for release with nunicode 1.8 sometime this summer (presumably June, following release of Unicode 10).
-
repo owner - changed status to resolved
Please reopen this issue or open new one in case of any problem with unaccenting.
-
Thank, I forgot about this. Will try it soon and report back. Very cool.
- Log in to comment
Oh, oops. forgot to log in before opening the issue :)