Support for normalization forms

Dvir Volk

Oh, oops. forgot to log in before opening the issue :)

2017-01-16T12:06:17+00:00

Aleksey Tulinov repo owner

Hi.

There is no Unicode normalization forms in nunicode so far, but unaccenting would be a nice thing to have in library in my opinion. Although, i don't think this topic is standardized.

Do i understand correctly that you're suggesting to put string into normalization form NFD or NFKD, then remove combining marks to get unaccented string? My question is: do you need proper NFD/NFKD or just need to remove diacritics from regular string of precomposed characters?

If you need just diacritics removed, do you have any preference on the method of removal? For instance, could you take a look at Postgres' unaccenting rules: https://github.com/postgres/postgres/blob/master/contrib/unaccent/unaccent.rules and tell if this is what you would expect from unaccent() function or not.

It's somewhat different from NFD/NFKD, for example, copyright sign © doesn't decompose into ( + combining mark + C + combining mark + ) in Unicode, but Postgres has that and several other exceptions: ®, ×, etc. So it's more like compatibility decomposition with special cases. As far as i understand, it's mostly defined by CLDR's Latin-ASCII transliteration rules, but also defined by other people adding exceptions (because Latin-ASCII transliteration don't include Cyrillic-ASCII transliteration, for example).

Is that good or you would expect just decomposition (NFD) and removal of combining marks from source string?

2017-01-16T17:57:33+00:00

Aleksey Tulinov repo owner

You could try fd7dd75 on master, although please note that master is Unicode 10 beta. Master introduces nu_tounaccent() which supposedly does unaccenting. This is not compliant to anything in Unicode, but presumably it should work more or less as expected for European languages. Some details on it are here: https://bitbucket.org/alekseyt/nunicode/overview#markdown-header-unaccenting .

It's planned for release with nunicode 1.8 sometime this summer (presumably June, following release of Unicode 10).

2017-04-04T14:14:45+00:00

Aleksey Tulinov repo owner

changed status to resolved

Please reopen this issue or open new one in case of any problem with unaccenting.

2017-04-04T14:15:56+00:00

Dvir Volk

Thank, I forgot about this. Will try it soon and report back. Very cool.

2017-04-30T12:46:03+00:00

Comments (5)