option for accent-insensitive matching

Create issue
Issue #191 wontfix
Felix König created an issue

with the full unicode case folding this is possible:

regex.match(r"strasse", "straße", regex.V1 | regex.IGNORECASE)

However, these do not match:

regex.match(r"gehor", "gehör", regex.V1 | regex.IGNORECASE)
regex.match(r"francois", "françois", regex.V1 | regex.IGNORECASE)

So I would really appreciate a flag, propably IGNOREACCENTS, that solves this problem. I also described it here: StackOverflow: Regex - match a character and all its diacritic variations (aka accent-insensitive)

Sorry if this is already possible and I just missed how.

Comments (9)

  1. Matthew Barnett repo owner

    Two codepoints would be considered the same if they are converted to the same codepoint(s) by some algorithm.

    The problem is: what is that algorithm?

    You could try decomposing the codepoint with unicodedata.normalize('NFKD', codepoint) or unicodedata.normalize('NFD', codepoint) and then removing all those codepoints where unicodedata.category(codepoint) returns 'Mn' (are there any other categories that should be removed?).

    Then there are codepoints such'ø' that don't decompose; you might think it should match 'o', like 'ö' does, but it doesn't. Without a defined series of steps, you'd have to define the equivalences (or exceptions) by hand.

    There's also the question of whether you should be doing it.

    Is it really OK to ignore the accents?

    What about Latin 'A' vs Cyrillic 'А' vs Greek 'Α'? As you can see, they look the same.

    If you can come up with a solution, please let me know!

  2. Felix König reporter

    For my case it didn't really matter that much, but the general idea was to ignore (latin) diacritics, which are well-defined. It looks like bringing the string in NKFD and removing Mn-category codepoints does exactly that for me. I also managed to reach my goal with the unidecode module, which basically normalized all those characters into ascii.

    I don't know enough about unicode and all it's twirks to come up with a clean and general solution, or a definition of what it should do. Thats why I hoped there was some well-defined flag or something that does this. With your previous comment I was able to find out that for example C# has a flag called IgnoreNonSpace to do exactly what you described: https://msdn.microsoft.com/en-us/library/system.globalization.compareoptions(v=vs.110).aspx Would copying some of those (unicode-related) flags for regex be a good idea or just balast for the library?

  3. Matthew Barnett repo owner

    Those flags are for a simple, straight-foward, string-comparison method, i.e. does this string match that string.

    In my case, it's a regex, which has a whole lot of other stuff going on too, and it's already a lot of code!

  4. animalize

    I have an idea, a bit complex, let me say it.

    First step

    We need a better unicodedata module, the current one is so limited. It may have these features:

    1. A powerful string comparer/finder, like C# mentioned above.

    More than that, it can ignore looking-similar characters, like Latin A (\u0041) vs Cyrillic А (\u0410) vs Greek Α (\u0391).

    And provides other useful options or extension mechanisms.

    2. A better normalize algorithm.

    NFC, NFD, NFKC, NFKD is limited by character order, e.g. two characters E with ACUTE combining MACRON (\u00E9\u0304) can not be converted to one character E with MACRON and ACUTE (\u1E17) by current options, although they are same in linguistic. This can be fixed in new algorithm.

    FYI, E with MACRON combining ACUTE (\u0113\u0301) can be convert to E with MACRON and ACUTE (\u1E17) now.

    3. Misc features.

    Emoji processing, like https://pypi.python.org/pypi/emoji
    Split lines with Unicode linebreaker.
    ETC.

    Second step

    Let the new unicode module expose an internal API to regex, it just provides an "Unicode string comparer". I suppose the API is pretty similar to Named List of regex.

    So that the users can integrate Unicode comparer with regex.

    Then, original post problem can be done in this way, Ua means ignore accent:

    single string

    regex.match(r"(?Ua:gehor)", "gehör")
    

    or a list

    regex.match(r"\Ua<lst>", "françois", lst=["gehor", "francois"])
    

    or integrate with regex

    regex.match(r"\Ua<lst> (says?|said)", "françois says", lst=["gehor", "francois"])
    

    I suppose this will cover most requirements of RE+Unicode. If someone has speical requirement beyond this, he/she can write some Python code to finish under the help of regex and new unicode module.

    IMHO, the .FULLCASE flag is not very useful in real cases, see #178. It can be moved to new unicode module, and be implemented without regret like #178. Meanwhile, regex's LOC will reduce a lot.

  5. Matthew Barnett repo owner

    This is an open source project. You're free to fork it and add whatever ehancements you want yourself.

  6. Log in to comment