Certain wildcard regexes cause pyPEG to go into a memory-consumption loop

Issue #17 invalid
xiongchiamiov
created an issue

Take this simple script:

import pypeg2

class Dialog(str):
    grammar = pypeg2.maybe_some(pypeg2.restline)

text = '''
"Lorem ipsum."
'''
pypeg2.parse(text, Dialog)

On my Macbook Pro, running this just keeps consuming more and more memory - the longest I've let it run, it got up to 900 MB of real memory and 4.5 minutes of cputime.

Some small amount of experimentation shows that this occurs any time a regex includes * (star).

[$]> uname -a
Darwin geror.local 11.4.2 Darwin Kernel Version 11.4.2: Thu Aug 23 16:25:48 PDT 2012; root:xnu-1699.32.7~1/RELEASE_X86_64 x86_64
[$]> python --version
Python 3.3.2
[$]> pip freeze
invoke==0.5.1
nose==1.3.0
pyPEG2==2.14.0

Comments (5)

  1. Volker Birk repo owner

    This is not a bug. The grammar is senseless. restline matches the rest of the line and never fails. If there is no rest, restline matches the empty string and does not fail.

    So maybe_some(restline) can never terminate.

  2. xiongchiamiov reporter

    I don't quite understand why this grammar isn't sane; it's saying "possibly match n number of things, where the thing is anything". So you should end up with one thing that is the whole of the text you pass in, no?

    Perhaps an example of what I'm trying to do would help.

    In the text I'm parsing, there is the concept of a Dialog. A dialog consists of some number of quoted strings on a line, eg:

    "Lorem ipsum?"
    "Dolor!" "Sit amet!"
    

    I modeled this like so:

    class Dialog(str):
        grammar = pypeg2.maybe_some(string), pypeg2.endl
    

    A string, then, is a quote, followed by some stuff, then a matching closing quote. I used this regex to represent it:

    string = re.compile(r'".*"')
    

    (I believe at some point in time I also tried a more sophisticated matcher that matched only non-quotes in the middle, since I'm not sure how Python handles greediness.)

    This, however, fails in the way I've described.

    Now, while what I've done seems perfectly reasonable to me, I know extremely little of parsing theory, so I don't doubt I'm doing something obviously incorrect. What is the approach I should be taking to model a Dialog?

  3. Volker Birk repo owner

    Your regex will not match any string. It will fail. A possible solution coud be to add a question mark after the asterisk:

    string = re.compile(r'".*?"')
    

    As an alternative, you can use:

    string = re.compile(r'"[^"]*"')
    

    Your sample code has a second problem. You shouldn't parse with a complex grammar into a class derived by str. A complex grammar will give a list of objects. Because it cannot be stored otherwise, you will end up with a text representation of a list as the value of an object of the type Dialog(str)

  4. Volker Birk repo owner
    >>> from pypeg2 import *
    >>> string = re.compile(r'".*?"')
    >>> class Dialog(List):
    ...     grammar = maybe_some(string), endl
    ...     
    >>> parse('"hello" "world"', Dialog)
    Dialog(['"hello"', '"world"'])
    
  5. Log in to comment