When using recursive regex, groups are not assigned well

Create issue
Issue #363 invalid
Emil Bode created an issue

A reprex says most I believe:

In: import regex as re
In: myregex = '\(([^()]*|(?R))*\)' # find set with matching parentheses
# Note that the first capturing group is [^()]*|(?R), that is everything between matching parentheses.
In: teststring="This is (a) test (really (yes, really))"
# End-goal: extract sets of matching parentheses, i.e. ["(a)", "(really (yes, really))"]

In: re.findall(myregex, teststring)
Out: ['', '']

Expected behaviour:

A return of ['a', ‘really (yes, really)’]

I’ve found with related functions that the entire matches are correct, but when reporting matching groups it fails:

In: re.search(myregex, teststring)
Out: <regex.Match object; span=(8, 11), match='(a)'>
In: re.search(myregex, teststring, pos=9)
Out: <regex.Match object; span=(17, 39), match='(really (yes, really))'>
In: re.search(myregex, teststring).group(1):
Out: ''

In: [x for x in re.finditer(myregex, teststring)]
Out: [<regex.Match object; span=(8, 11), match='(a)'>,
 <regex.Match object; span=(17, 39), match='(really (yes, really))'>]
In: [x.groups() for x in re.finditer(myregex, teststring)]
Out: [('',), ('',)]

In: re.sub(myregex, ' ~~ \\1 ~~', teststring)
Out: 'This is  ~~  ~~ test  ~~  ~~'

Note that manually making another group kind of works:

re.findall('('+myregex+')', teststring)
[('(a)', ''), ('(really (yes, really))', '')]

To me it looks like it can’t decide on what to number the group, as it is defined recursively. Maybe it uses the last match for the group (which is empty)?

Environment

Tested in 2 environments:

  • Python 3.7.1 (64-bit) under Spyder 3.3.2, Windows 10, with regex-version 2.5.31
  • Python 3.7.3 (64-bit) under Spyder 3.3.4, Windows 10, 2.5.74

I’ve also noted that in this case, there is no difference whether I specify V1=True or not, even though the native re-module throws an error on using ?R

Edit (added):

Issue #78 works as expected for me.

And using captures() does capture the results, though it’s not entirely clear to me from the docs what the captures-function should do (it’s not clear to me how it’s supposed to be different from groups())

Comments (5)

  1. Matthew Barnett repo owner

    Given the regex:

    \(([^()]*|(?R))*\)
    

    and the text:

    (a)
    

    Here's what happens:

    1. Match '('.

    2. Begin capture.

    3. Match 'a'.

    4. End capture. We captured 'a'.

    5. Repeat? Yes, because we matched 'a'.

    6. Begin capture.

    7. Match ''.

    8. End capture. We captured ''.

    9. Repeat? Not this time, because we matched ''.

    The last capture of group 1 was ''.

    So, to me it it looks like it's not a bug.

  2. Emil Bode reporter

    Okay, I get why it works this way.

    But shouldn’t there be a way to extract what was matched?

    I was mostly confused by findall just returning empty strings.

    For my use case, I don’t really care about what group something belongs in, or whether or not the outer parentheses are reported, but it felt really ugly to have to resort to [x.group(0) for x in re.finditer(myregex, teststring)], just to find out which substrings matched my regex.

  3. Matthew Barnett repo owner

    findall returns what the groups matched if there are groups in the regex; it returns the entire matched portion of the string only if there are no groups. The regex module is written to be backwards compatible with the re module, so that behaviour is not going to change.

  4. Log in to comment