Are atomic groups taking up more memory than necessary?

Create issue
Issue #309 resolved
Former user created an issue

I've noticed that the following regex: r'<!--(?>[^-]++|-(?!->))*+-->' (which is supposed to do the same job as the old good comment regex: r'<!--[\s\S]*?-->') ends up with a MemoryError when there is a very long comment in the string.

regex.findall(r'<!--(?>[^-]++|-(?!->))*-->', paste())
Traceback (most recent call last):
  File "<input>", line 1, in <module>
  File "...\Python37\lib\site-packages\regex.py", line 333, in findall
    overlapped, concurrent)
MemoryError

Why is that happening? Since all the groups in the regex are atomic ones, the engine only needs to remember the last successful matching position before each atomic group and proceed towards the end as soon as each group fails. That shouldn't require that much memory, should it?

Now I know that what I just said is a gross simplification of what is actually happening under the hood and I might be wrong about it, but I thought maybe there is something that can be improved so I decided to create this issue.

Here is a sample text you can download: https://test.wikipedia.org/wiki/User:Dalba/comment-regex-test?action=raw&ctype=text/x-wiki

Comments (4)

  1. Matthew Barnett repo owner

    You have 2 slightly different regexes there. In the first paragraph you have (?>...)*+ (possessive), whereas in the trackback you have (?>...)* (greedy).

    The memory problem is more to do with the repeat than the atomic group.

    This:

    (...)*+
    

    is basically the same as this:

    (?>(...)*)
    

    It behaves like a greedy repeat, remembering the state for each iteration, until it can match the repeated part no more. Only then does it leave the outer atomic group and discard the accumulated backtracking information.

    You can see this more clearly with:

    >>> print(regex.search(r'(aa|a){2}+', 'aa'))
    <regex.Match object; span=(0, 2), match='aa'>
    >>> regex.search(r'(aa|a){2}+', 'aa').groups()
    ('a',)
    >>> print(regex.search(r'(?>(aa|a)){2}+', 'aa'))
    >>> None
    

    I'll try to look for ways of reducing the memory usage.

  2. 40a

    Thanks, Matthew! It worked. Yet I came up with a similar issue with possessive groups:

    >>> regex.search(r'<!--([^-]*(-(?!->))*)*+-->', paste())  # successful match
    ...
    >>> # Change the two greedy quantifiers into possessive ones:
    >>> regex.search(r'<!--([^-]*+(-(?!->))*+)*+-->', paste())
    Traceback (most recent call last):
      File "<input>", line 1, in <module>
      File "...\Python\Python37\lib\site-packages\regex.py", line 266, in search
        concurrent, partial)
    MemoryError
    
  3. Log in to comment