1. Juancarlo Añez
  2. grako
  3. Pull requests

Pull requests

#13 Merged

stateful parsing for grako

  1. Marcus Brinkmann


let's see if you like this or not :) It's a pure feature request, and I know the README says that grako is feature complete. However, in the parser I am working on (mediawiki-like syntax), I have several issues that would benefit a lot from stateful PEG parsing:

  • the syntax incorporates several HTML tags, which need proper nesting. In addition, wrong nesting must not prevent parsing. Instead, the wrong elements are rendered literally using a catch-all default rule. That default rule must be stopped from matching a proper close tag by a context-depending negative look-ahead assertion. I currently solve this by duplicating two rules for every valid tag.
  • the syntax uses bullet-style lists of various types, that can be nested arbitrarily. For example, a list element could be "**#*" followed by a "*#" item, which would mean that three sublist have to be closed and one new sublist opened. I currently collect all such list items in a flat list independent on the nesting level and implement nesting in a post-processing step, as stateless PEG parsing just can't do this, because of lack of locality.
  • the syntax allows a paragraph to be indented by a space, which wraps it in a pre-element. But this whitespace must be stripped from the other inline elements that are allowed in a paragraph (when they span over a newline). To work around this, I would have to copy all inline element rules to change the meaning of "\n" to "\n " (note the additional space character), causing dozens of rules to be duplicated.

The last item (content-dependending whitespace in inline elements) is what pushed me over the edge to implement stateful parsing in grako. I did consider several implementations. One of my favorite ideas was to extend the grammar to allow context-specific rule replacement and state-transitions on the rhs of a rule, like this (to continue the example above):

pre = paragraph/prestate ;
newline = "\n" ;
newline/prestate = "\n " ;

This is nice, because it makes the grammar readable (also, it can be implemented as a grammar transformation without special support), but it is rather limited. It can not handle the nested-lists example above, and for XML-style tags it still requires some repetition.

So, I decided I would ignore the grammar side and push the semantics layer a bit. The result is this pull request. It adds what I think is the absolute minimum amount of fuzz to Grako to support stateful memoization and semantics that can access the parser context and thus get and set the state (currently by directly accessing _state).

This implementation does not change the grako grammar, and it is fully backwards compatible. Its performance impact is minimal for the problem (I could not measure a difference, but my test cases are rather small).

Semantics are now much more powerful, as they can access the parser context. This allows dynamic rule generation (of course, the created rule must only depend on the position, rule and state, to not break memoization). States can be any hashable object, so lists and dicts need to be converted to tuples and frozensets (recursively). It's a good idea to internalize these state objects so only one copy is in memory, but this can be implemented at the semantics layer without special support, by keeping a dictionary of state objects:

The test/example shows how to implement wiki-style lists without any post-processing. I added some comments below for your convenience.

Comments (5)

  1. Juancarlo Añez repo owner

    I'm all for this. It is a major change, so it would require a 2.1.0 or a 3.0.0 version. I favor 2.1 as the change doesn't seem to break anything already working. How about working on a branch for a while? I can see obvious simplifications like self.goto(p, state), and keeping the branch up to date with changes on 2.0 should be fairly easy. My preference is to develop new features in named branches; this one could be called something like stateful_rules?

    Oh! I don't see the changes to the bootstrap parser or the grako.ebnf grammar, but you already mentioned that.

  2. Marcus Brinkmann author

    If you create a new branch, I'll edit the pull request to be for that branch (I think that's how this works, although I am not sure).

    As I said, I was contemplating the grammar change, but then didn't go for it. It's not very powerful, more complex to implement, but of course also more convenient to use. Is the grammar change something you would like to see?

  3. Juancarlo Añez repo owner

    I created a stateful_parsing branch and accepted the pull request to it. I can do the changes to the grammar and the bootstrap parser once you formalize the syntax and the semantics.

    On a related note (not for this branch), it might be time to switch the bootstrap parser to the one generated by the bootstrap tests.

  4. Marcus Brinkmann author

    I think that's fine. My mediawiki parser shows that even without syntactical support, it is very useful, so why not first get some more experience with stateful parsing in general before adding shortcuts.