[patch] lexer grammars for parser generators: ANTLR, Ragel

Issue #345 resolved
created an issue

I have attached a new parsers.py file which contains pygments lexers for ANTLR and Ragel parser generators.

I have tested this on a large number of example files (http://ananelson.com/tmp/). If you would like to add examples for this to your test directory let me know what license you require them to be and I will look for something suitable.

I have not changed the copyright notice on whichever lexer file I copied as a template.

I'm sure there are a few edge cases that will need tweaking, so you might want to put my contact info in this file for people to get in touch with me directly if you decide to add it to pygments.

Comments (14)

  1. ananelson reporter

    This is great, thanks. :-)

    ANTLR has an explicit language statement which is mandatory. Grammars are Java by default, any other language must be stated explicitly, e.g. "language=Python;" (there can be whitespace in there)

    Here are the statements for the available languages I found on the ANTLR website's documentation:

    language=Python; language=Ruby; language=ActionScript; language=JavaScript; language=CSharp2; language = Perl5; language = C;


    As with Ragel, this will only be reliable if you happen to be parsing a full file since a fragment might not contain this.

  2. thatch
    • changed milestone to 1.1

    This is mostly merged in [http://code.timhatch.com/hg/pygments-tim my branch] now. I found two locations of roundtrip errors, which are now fixed. The example file for Ragel is now explicitly using the ragel-cpp lexer (the examplefiles test can use `<lexer>_<filename>` to work around things like this).

    When this ticket was opened, `analyse_text` was only used when `pygmentize -g` was specified. Since #355, it is now used whenever there are multiple lexers for the same extension as well.

    Ana, can you see any good way to infer the file type for the source inside Antlr files? Anything else you would like addressed?

  3. ananelson reporter

    Okay, one example each for ANTLR and Ragel have been uploaded. Authors of both are happy for them to be distributed under BSD with the rest of Pygments.

    Although the Ragel example works for me when I invoke it directly with:

    pygmentize -l ragel-cpp tests/examplefiles/rlscan.rl

    it's not cooperating too well with "make test".

    I know that the last few lines of code, 281-289, cause a problem with the C++ parser. I left them in, just in case the C++ maintainer wants to take a look at them. I'm happy for these lines to be deleted, though, since I don't think the syntax in question is that common or likely to be syntax highlighted.

    I have deleted L281-289 in my tests/examplefiles/rlscan.rl

    When I ran "make test" after adding this file, I got an error saying that an error token had been generated. I eventually guessed this was due to the wrong lexer being used, so I changed L46 of test_examples.py to help troubleshoot:

    self.failIf(type == Error, 'lexer '+lx.class.name+' generated error token for '+absfn)

    This confirmed that Pygments was trying to use the RagelJavaLexer instead of the RagelCppLexer. I worked around this by making the RagelCppLexer last in the list of lexers, that seems to be the one that Pygments chooses when there are multiple lexers which handle a given filename extension.

    This fixed the error token, but now I am getting a round trip error. (sigh) I'm not sure why. I will take a look when I get time.

    I was a bit confused in development as to whether analyse_text() is automatically invoked in the situation when you have multiple lexers claiming to handle, e.g. .rl files, as in this case. There is an informal ragel convention to put something like @lang = c++ in a comment at the top of your file, and I could have analyse_text look for something like this and use that lexer if it finds it. Otherwise I think it's fine to have to specify the -l option. But, is analyse_text used in this situation?

    I will try to figure out what's happening with the round trip error. It might be a few days before I get a chance though.

  4. ananelson reporter

    I have made the suggested changes to parsers.py and formatted it so now "make check" doesn't report any errors.

    I am working on obtaining suitably licensed examples for testing.

  5. ananelson reporter

    Re item no. 4, I intermittently get the lexer roundtrip error, but other times the tests run perfectly. I find that a "make mapfiles" does something which seems to fix the error.

    I tried removing AntlrActionScriptLexer and then I started getting the roundtrip error on the previous lexer in the file, AntlrJavaLexer, so whatever the issue might be I don't think it's related to that particular lexer.

  6. ananelson reporter

    My apologies for not replying sooner! I wasn't subscribed to the feed or email notifications. I have done so now.

    1. Yes, absolutely right. Not sure how this worked without the comma. 3. How many example files do you want to include? I suggest just 1 each for ANTLR + Ragel. 4. I will take a look at this and see if I can find anything. We can delete this lexer combo in the mean time if the problem can't be identified. 5. I started to do this at one point. Can you suggest an easy way to test analyse_text?

    I will post updated file and examples for testing shortly.

  7. thatch

    Hi Ana,

    Looks great. I'm working on integrating this into the Pygments codebase this week. Here are my thoughts having spent about an hour looking at the code:

    1. line 149 of the parsers.py doesn't have a comma, but seems it should. Can you confirm? 2. Yes, contact info would be helpful for the copyright line and AUTHORS file. 3. Yes, we'll need example files for the automated tests to run on. The rest of Pygments is BSD-licensed. 4. I haven't tracked the issue down yet (might be the other lexer) but the !AntlrActionScriptLexer doesn't roundtrip properly and is failing the automated tests. 5. Be thinking of how `analyse_text` can be implemented to quickly detect whether it's a match for your lexers (at least the lexers likely to be used in the wild)

    Could you go ahead and reformat the file to fit the coding conventions enforced by the 'make check' target (basically max line length around 80 chars, and it's `Cpp` instead of `CPP`)?

  8. Log in to comment