Memory Error - regex.findall

Create issue
Issue #383 resolved
Tyler Whitcomb created an issue

I’m trying to use regex.findall to parse directives and arguments from code comments, but using regex I end up getting a MemoryError on line 337 of regex.py. Doing the same with the re module returns matches with no errors. Here’s the relevant code:

  compiled = regex.compile(
      r"(?P<prefix>(?:\s*)?(?:#|//|--|/\*|<!--)*(?:\s*))*?(?P<directive>@directive)"
      r"(?:(?P<preopenbracket>\s*)(?P<openbracket>\[)(?:(?P<postopenbrakcet>(?:\s*)?(?:#|//|--|/\*|<!--)*(?:\s*))*?"
      r"(?P<metadatakey>.*?)(?P<metadatacolon>:)(?P<metadataspacing>\s*)(?P<metadatavalue>.*))*(?P<preclosebracket>(?:\s*)?"
      r"(?:#|//|--|/\*|<!--)*(?:\s*))*?(?P<closebracket>\])(?P<postclosebracket>(?:\s*)?(?:#|//|--|\*/|-->)*(?:\s*))*?)?$",
      regex.MULTILINE
  )

  matches = regex.findall(compiled, string)

This is the string I was testing (I’ve tried other similar strings with similar results):

@directive

# @directive [
#    Name:  GitHub
#    Link:  https://www.github.com/
# ]

/* @directive [
    Name:  Google     
    Link:  https://www.google.com/
] */

@directive
[
    Name:  Bitbucket
    Link: https://www.bitbucket.org/
]

I’m using 64-bit Python on a little-endian operating system.

Comments (3)

  1. Matthew Barnett repo owner

    The regex itself contains some features that are not recommended, such as repeated items that can themselves match zero characters. When you do that you're relying on the regex engine to not get stuck in a loop.

    The part matching the prefix is like that, and it also includes (?:\s*)?, which can be reduced to \s*.

    The problem seems to lie with the lazy repeats that contain repeats. Using a greedy repeat instead doesn't show the problem.

  2. Log in to comment