allow lookarounds in conditionals

Issue #163 resolved
Former user created an issue

It would be really helpful to allow allow lookarounds in addition to group name/id in conditional expression like in PCRE to allow a regex like this:
regex.findall(r'(?(?<=love\s)you|(?<=hate\s)her)', 'I love you but I don't hate her either. You and her are so different)

Comments (9)

  1. Matthew Barnett repo owner

    I don't see the point; as far as I can see, it doesn't add anything.

    The purpose of a conditional expression is to test whether a capture group has matched anything.

    What would it do that a bare lookaround doesn't already do?

    Wouldn't r'(?(?<=love\s)you|(?<=hate\s)her)' just give the same results as r'((?<=love\s)you|(?<=hate\s)her)'?

  2. Francesco Cabrio

    Hello, I posted the proposal as anonymous by accident.
    Yes it is indeed the same. I posted it as an example of behavior, not as a motivation. Sorry, poor explanation, I reckon.

    The real reason it interests me is to make the whole expression more general for metaprogramming or dynamic generation of regexes.
    Sometimes I want a conditional in a template to respond to the existence of a previously captured group or to content around the current position in the string to be matched. The specific behavior, the regex actually crafted, depends on what is going on within the main program calling the regex facility. To do this dynamically at runtime I have to treat both cases separately. If lookarounds were recognized it wouldn't be the case. It would make the template work cleaner.

    In short, the PCRE behavior doesn't add or detract anything from a manually crafted pattern but it would simplify some interesting dynamic techniques, especially in a language like Python that has great metaprogramming capabilities.
    Maybe there is a way to create a neat general template that does the same thing without resorting to lookarounds in conditionals but I tried to do that unsuccessfully.

    Regards

  3. Matthew Barnett repo owner

    I've realised that they're not the same.

    With a bare lookaround, if it chooses the first branch and subsequently fails, it'll backtrack and try the second branch.

    With a lookaround in a conditional expression, if it chooses the first branch and subsequently fails, it'll backtrack but won't try the second branch.

    For example, on the string "123abc", ^(?:(?=\d)\d+\b|\w+) will match but ^(?(?=\d)\d+\b|\w+) won't match.

  4. Francesco Cabrio

    Oh, I feel dumb now. It makes sense the whole conditional has to be skipped, while with alternations it backtracks to another alternate expression. It implements mutual exclusion, not alternation. It turned out that, in my code, the 'then' and 'else' subexpressions were simple and mutually exclusive, I got lucky with my ignorance and it worked (close call!).
    It doesn't work like a general tool anymore like I thought it would if the 'then' subexpression triggers a backtrack like in your example, otherwise it still works but it's risky business and definitely not a good reason to ask you to implement that behavior.
    Unfortunate, but that behavior is still really interesting. I see how it can be useful though.

    Any expression of the type (? (test) then | else )
    should be refactored like (test) then | (complement-of test) else

    This (unsightly) example:

    regex.sub(r'(?(?<=(?:[^3](?=..a)))(\d\D)|.)', r'\1-', '23dac83a6bc93ad')
    

    should be refactored to:

    regex.sub(r'(?<=(?:[^3](?=..a)))(\d\D)|(?<!(?:[^3](?=..a))).', r'\1-', '23dac83a6bc93ad')
    

    both returning '-3d---8-----9---'

    Besides making the pattern ugly it can make the match significantly slower since it has to check (complement-of test) every time it has to backtrack. Especially with variable length or complex lookbehinds. For these two reasons I think it is still a valuable enhancement to implement. Actually more than my original motivation since this would be more frequently applicable than some dynamic generation/metaprogramming scenario.

  5. Matthew Barnett repo owner

    It's not only the 'then' part that could trigger backtracking. It could match the 'then' part, progress into the remainder of the pattern, fail, backtrack through the 'then' part, then try the 'else' part.

    Anyway, it's now on my todo list.

  6. Francesco Cabrio

    Great!
    But I don't understand previous remark.
    How in:

    regex.findall(r'(?(?=\w).{3}|.+)b', 'a123bc')
    

    the 'else' part could be checked after backtracking through the 'then' subexpression? Isn't the whole conditional skipped and have the pattern position pointer after ...\w+) ?
    A backtrack could be triggered in the 'else', sure, but how can it be reached after the 'then' has been traversed, regardless of its success?

    I get 123b not a123b in https://regex101.com/#pcre

    By the way, thank you for your replies and congratulations for the rest of the work you've done to this very good regex package.

  7. 王珺

    As I test the original problem fails:

    regex.search(r'(?(?<=love\s)you|(?<=hate\s)her)', "I love you")
    

    yields None while you is expected, and python crushes while executing

    regex.findall(r'(?(?<=love\s)you|(?<=hate\s)her)', "I love you but I don't hate her either")
    

    Regards

  8. Log in to comment