= for fuzzy matches

Issue #41 resolved
Former user created an issue

operator could be pretty handy for fuzzy matches, finding only erroneous text. For example, in a list of hotmail email accounts, you could search for misspells like '@(hotmail\.com){e=1}'. This will save the user an extra "grep -v" for filtering out correct emails in the list of matches.

Comments (11)

  1. Former user Account Deleted

    or even something like this would be more powerful:

    '@(hotmail\.com){0<e<3}'

    matching only texts with 1 or 2 errors

  2. Former user Account Deleted

    A regex such as:

    @(hotmail\.com){e=1}

    can be we written as:

    @(?>(hotmail\.com){e<=1})(?<!@hotmail\.com)

    although that's not quite as convenient, I admit! :-)

    Note that the fuzzy part needs to be in an atomic group in order to stop it backtracking to find a worse match. For example, given the string "@hotmail.comb", the fuzzy part will match "@hotmail.com" with 0 errors, then the negative look-behind will reject it, so the fuzzy part will match "@hotmail.comb" with 1 error.

    I'm not sure how easy it'll be to add a lower limit; such a problem could still occur.

  3. Former user Account Deleted

    I think I've figured out how to do it, but how much demand is for it? You gave an example, but is that a real use case?

  4. Former user Account Deleted

    I am fixing tags for 25k+ text documents for a web site, so I do have a real (different) use case. That was just an example. But I think it would be a really nice feature for regex module...

  5. Former user Account Deleted

    here is a real example translated into english

    3 servic detection 1 service detect 5 service detecti 46 service detection 1 in service detection

    The site has manually entered tags, and their frequencies from 25k+ (non-english) text documents. Most of the time the correct one has a high frequency, and anything that is close enough to a correct one (except itself) should probably get fixed..

  6. Former user Account Deleted

    What fuzzy regex would you use to match the incorrect strings in your example? Would it be this:

    (?:^\d+ service detection$){1<=e<=3}

  7. Former user Account Deleted

    no no, the first part is the frequency of a tag, not part of it. I would search a match with:

    r = compile(r'(?:service detection){0<e<5}') m = r.match(str) if m:

  8. Former user Account Deleted

    Added in regex 0.1.20120119.

    Note that it supports only constraints of the form e<=3 or 1<=e<=3 ("<" is also allowed), but not "=".

  9. Log in to comment