An attempt to transform between a language defined through ANTLR's meta-language and PyParsing's object oriented interface.


Programming languages are used throughout the computing domain to specify anything from the structure of data (e.g. JSON, XML, YAML), to behaviour (e.g. Go, Java, C, Python, Squeak), to the appearance of data (e.g. HTML, Latex, Postscript).

For a computer to be able to make sense of the symbols, words and phrases that make up this type of languages, this has to be expressed in a compact form that allows its systematic processing.

In the case of context-free languages, this compact form has traditionally been the Backus-Naur-Form (BNF) and the systematic processing of such languages is handled by something called a recursive descent parser.

Recursive descent parsers can certainly be crafted by hand, or more generally by programs called compiler-compilers.

ANTLR is such a compiler-compiler written in Java. The ANTLR project provides, amongst other things, a language to express other languages in (a meta-language similar to BNF but with more capabilities) and a compiler that can generate a parser for some language in a number of different programming languages at its output called "targets". For example, ANTLR could accept the description of a simple CSV file format and generate the code to read the contents of the file in Java, C or Python.

PyParsing on the other hand, is a Python module that is dedicated to the description of parsers. PyParsing provides an object oriented application programming interface to the elementary objects required to describe a language and allows a programmer to piece them together to express a parser in the Python programming language.

What is this all about then?

At the time this piece of software was written, ANTLR v4, lacked the ability to produce parsers in Python (This was well within the capabilities of the software but other target languages seemed to be taking priority over Python).

Therefore, it was only natural to ask "Would it be possible to write a program that understands ANTLR's meta-language and produces a Python program where PyParsing would be used to express a parser?"

The answer, or at least an attempt at an answer, is found in the source of this project.

How is it done?

This translation between representations occures in two steps. The first step is parsing the ANTLR meta-language and the second step is producing the PyParsing output according to a set of translation rules.

As far as parsing ANTLR's meta-language is concerned an ANTLR(v4) description was already available. This was translated manually, almost clause-to-clause to a PyParsing representation.

Translating the meta-language is a matter of finding suitable mappings between the two representations. For example:

NUM:[0-9]+ --maps to--> NUM = Regex("[0-9]+")

or a more complex example:

HEX:'0x'[0-9A-Fa-f]+; INT:[+-]? [0-9]+; numeric:HEX|INT;

would map to:

HEX=Forward() INT=Forward() numeric=Forward()



numeric << (HEX^INT);

These are rather simplistic examples but they do demonstrate one way to map between representations. Unfortunately, more than one mappings could be used to produce the same end-result with varying degrees of efficiency, but this is perhaps the object of a third step of refinement or optimisation of the generated code.

What is the current status and where do i go from here?

At the moment, ANTLR2pyparsing has a PyParsing representation that can theoretically parse any ANTLR file that can be thrown at it.

It also has a text file that describes the mapping between elements of ANTLR's meta-language and PyParsing to some extent and some work has already been undertaken towards applying the transformation and generating PyParsing files.

These mappings were derived manually by observing the syntax trees produced by various existing ANTLR grammars or constructing examples leading to syntax trees with particular structure.

For this reason, it would be best to have someone with very detailed knowledge of ANTLR to review the existing transformation rules for all the elements that make up the language rather than trying to derive them by example.