Have an option for POSIX-compatible longest match of alternates

Issue #150 resolved
Former user created an issue

Hello there,

Currently both re and regexp short-circuit the first match for alternate matches. For example, (A|AA)$ matches only the last character in AA.

On the other hand, POSIX regex (C, C++, Boost, Ruby) would demand that the longest leftmost match is returned, i.e AA. Most modern engines seem to reject this on the basis that it makes the engine terribly slow (because it cannot match alternates eagerly).

However, the leftmost longest overall match behavior can be quite useful in some situations, where otherwise workarounds are needed and it looks like there is currently no engine for Python which supports this behaviour.

It would be nice to have the POSIX behaviour of the longest submatch as an option when compiling a regular expression.

Comments (10)

  1. Former user Account Deleted

    It is kind of hard to come up with examples that seem really convincing. There is always a workaround. Leftmost longest can just be convenient, but also very, very slow and memory consuming. However, it is the behavior that I usually would expect. I actually was surprised that there was short-circuit instead.

    Nonetheless: One example might be identifier matching, where the identifiers have a common prefix but are otherwise configurable: 'getItem|getItemValue|setItem|setItemValue'. Here, Python won't ever match 'getItemValue' and 'setItemValue'. The workaround is to sort the alternate elements by length, but that only works reliable if the subexpressions are simple strings.

    r'\d+(\w(\d*)?|[eE]([+-]\d+))': Matches either a number with an exponent or an integer with a base suffix (e.g. 'b', 'x', 'b12', where 'b12' means to base 12. This is simplified, because \d is too strict, here). Short-circuit treats the exponent as an 'e' suffix and stops. Workaround is to explicitly name the possible suffixes, introduce explicit boundaries (r'\b'), or exclude 'e' or --again-- re-ordering. Another one: some (German) umlauts can decomposed into character pairs. These may sometimes need to be count as a single character. A very naive version of umlaut-aware character counting could use r'(\w|ae|oe|ue|ss)', which does not work in Python. The workaround again is to have the r'\w' last. Also: (Mr|Mrs) (from the perl manual).

    Also related to this (but different): Matching 'one(self)?(selfsufficient)?' against 'oneselfsufficient' regex (and re, and perl) will match 'oneself' instead of the full string (longest match). POSIX would require the full match.

    Glenn Fowler has some examples where he analyzes the POSIX behavior.

    Okui and Suzuki claim to have an algorithm which avoids the worst case exponential explosion. That might be of interest.

    For reference: POSIX, ruby, and PCRE (in my tests) do leftmost longest for alternates.

    Python, Perl, Java, and JavaScript short-circuit for alternates. I was not able to find an alternative engine for Python that implements POSIX behavior.

    Go has both (Compile and CompilePOSIX).

    Engines also differ with regard to optional subexpressions (see above). This is actually what worries me a little, but I would need to do a survey to table up the various engines' behavior in this case.

  2. Matthew Barnett repo owner

    I've used some online regex testers.

    As far as I can see, PCRE (PHP), JavaScript and Ruby all use first-match, not longest match.

    I'm not surpised that PCRE uses first-match because the "PC" part stands for "Perl-Compatible", and Perl uses first-match.

  3. Former user Account Deleted

    Hmm, that might have been an error in my examples, I think I tried (a|aa)$ which matches aa on all engines because of the leftmost. With ruby (I have not re-tested that, yet) and PCRE gone, as it looks like, there isn't any proper POSIX engine with unicode support around anymore. It's probably not too bad if Python doesn't have one, too (it also didn't have one before regex).

  4. Matthew Barnett repo owner

    It appears that there's a bit more to it than simply the "leftmost longest match"; there's also the question of the capture groups.

    I found the rules a little hard to understand the way they were written, so I've rephrased them as follows:

    Have we already found a match?
    Yes:
        How long is it compared to the existing one?
        Shorter:
            Reject the new match.
        Longer:
            Accept the new match.
        Same length:
            For each group:
                Where did it match?
                Earlier:
                    Accept the new match.
                Later:
                    Reject the new match.
                Same position:
                    How long is it compared to the existing one?
                    Shorter:
                        Reject the new match.
                    Longer:
                        Accept the new match.
                    Same length:
                        Do nothing.
            Reject the new match.
    No:
        Accept the new match.
    
  5. Former user Account Deleted

    Hmm. SingleUnix literally says

    • The search for a matching sequence starts at the beginning of a string and stops when the first sequence matching the expression is found, where first is defined to mean "begins earliest in the string".
    • Consistent with the whole match being the longest of the leftmost matches, each subpattern, from left to right, matches the longest possible string
    • as well as It is possible to determine what strings correspond to subexpressions by recursively applying the leftmost longest rule to each subexpression, but only with the proviso that the overall match is leftmost longest. For example, matching \(ac*\)c*d[ac]*\1 against acdacaaa matches acdacaaa (with \1=a); Note the syntax difference for BREs which need escapes for the controls.

    If I understand it correctly, the proviso part is the one that seems to get dropped for efficiency.

    In your algorithm sketch, does accept mean accept and terminate or just mark this as the currently accepted match?

    Because, If I understand you correctly, searching a(a)?(ab)? in aab should return the full aab (with groups None and ab) not aa (with groups a and None), because that is the longer overall match. Instead all but the POSIX-compliant engines and none of those that support Unicode seem to apply recursive subgroups match left-to-right, longest local match, ignoring external length

    I think we might have some really nice examples for the documentation sections at the very least ;-).

  6. Matthew Barnett repo owner

    The most important part is that it should be the longest overall match; the rest of it is about how to choose between possible matches that contain capture groups.

    In the algorithm code, "accept" and "reject" are about whether to accept or reject the new match as being the best one found so far at this position.

    Incidentally, the POSIX standard doesn't have backreferences (or so I've read), although some implementations have added them.

  7. Former user Account Deleted

    I'm not sure, but I think that depends on the version of POSIX you want to refer to.

    SingleUnix (which currently is IEEE Std. 1003.1-2013) has backrefs:

    The back-reference expression \n matches the same (possibly empty) string of characters as was matched by a subexpression enclosed between ( and ) preceding the \n.

  8. Log in to comment