Wiki

Clone wiki

inf225public / Syntax, Grammars and Trees

Grammars

A grammar describes the syntactic aspect of a language. Formally, a grammar is a tuple \(G = \langle T,N,P,S \rangle\), where \(P\) is a set of grammar productions; \(T\) is the set of terminals in \(P\) (the alphabet), \(N\) is the set of non-terminals in \(P\), and \(S\) is the start-symbol.

Syntactially, a language \(L\) is the set of strings over the alphabet \(T\) that conform to the grammar \(G\). \(L\) is a subset of \(T*\) – the (infinite) set of strings over the alphabet \(T\). Alternatively, we can define a language as the set of strings generated by the grammar \(G\) (by starting at the start-symbol, and generating all permutations).

In a regular grammar, the productions are of the form (with \(a\) being a terminal and \(A\) and \(B\) being non-terminals):

\begin{equation*} \label{eq:prod-rea} A \rightarrow aB \end{equation*}

or

\begin{equation*} \label{eq:prod-reb} A \rightarrow Ba \end{equation*}

A linear grammar allows both forms within the same grammar, but a regular grammar must use one or the other.

However, instead of writing productions as above, regular grammars often written using regular expressions (abbreviated re, regex or regexp). They are commonly used to define the structure of the basic words in a programming language.

For any regular grammar, we can define a finite state machine (or final automaton) which recognises sentences in the regular language. Commonly, such automata are used to divide a program text into individual words or tokens.

In a context-free grammar, the productions are of the form:

\begin{equation*} \label{eq:prod-cf} A \rightarrow (a|B)* \end{equation*}

Context-free grammars are used to defined the structure of the sentences of a programming language.

Concrete Syntax

Concrete syntax is described by a grammar; typically a context-free grammar (with the lexical sub-parts possibly described by a regular grammar). It is the syntax used when you write programs or text in a language.

A grammar is a particular implementation of a syntax – a formal set of rules that describe the syntax. The same syntax can be described in many different ways: there are many C++ grammars (and nearly all of them are wrong), but just one standard syntax.

In designing a concrete syntax, important considerations are:

  • Ease of use for programmers (making it feel "natural")
  • Familiarity / similarity to other languages (easier learning)
  • Robustness (small typos shouldn’t drastically change the meaning of a program without warning; syntax errors should be easy to identify and recover from)

For example, Java is designed to have a fairly easy to use syntax (e.g., the syntax "class A implements B" is easy to read, understand and remember) and to be familiar to C and C++ programmers. The C familiarity has a negative impact on robustness, because of C’s terse syntax – parse error recovery is made difficult by the low level of redundancy, and some simple mistakes can have big consequences:

for(int i = 0; i < 10; i++);  // ← whoops!
  System.out.println(i);

Syntax design is a language design issue. Grammar design is a language implementation issue (though the language specification will typically also come with a grammar, specifying the syntax). Important grammar design considerations are:

  • Technology. If the grammar is to be used by a parser generator, it is typically limited by what the parser generator supports. These limitations can be quite severe (for example, left-recursive grammars (i.e., with \(A \rightarrow A B\)) can’t be used directly with LL parsers).
  • Readability. Various tricks to avoid technological deficiencies can make a grammar really hard to read (e.g., avoiding left recursion); using features such as priorities and the *, ?, +, | and {} operators can make a grammar easier to read.
  • Structure. Some grammars will tend to produce deeply nested trees on parsing; this can be annoying depending on how your compiler is implemented.
  • Simplicity and compactness. Should every syntactical pattern that repeats itself be factored out into its own production? Or should one try to minimise the number of productions at the cost of bigger productions? This will typically impact both readability and structure.

Parse Trees

Parsing is the act of taking a text (e.g., a linear string of characters) and recovering its syntactic structure according to a grammar. A recogniser merely checks that a text conforms to a grammar; a parser also (at least conceptually) builds a parse tree that represents the structure of the text.

Parse trees are sometimes also known as concrete syntax trees. The root of a parse tree is the start symbol, the internal nodes correspond to grammar productions, and the leaves are terminals. Translating a parse tree back to a text is called unparsing. This is easily done by traversing the parsetree left-to-right, and printing the leaves.

The structure of a parse tree follows directly from the grammar and the parsing process: for every parse tree node, there is a corresponding production in the grammar that was used to parse the text for that node, and each parse tree node has one child for every symbol (terminal or non-terminal) on the right-hand side of a production rule.

The parse tree traces the sequence of productions that was used during the parse – different parsing techniques may produce different trees. With some generalised parsers and ambiguous grammrs, we get all possible trees – a parse forest.

A parse tree may also contain extra informations, such as the source code locations of all the nodes.

Abstract Syntax

The abstract syntax tree (AST) of a program encodes just the information necessary to preserve the meaning of the original program text. The abstract syntax describes the structure of the abstract syntax tree – it can be defined using a regular tree grammar, or an algebraic data type or term (in Rascal, ML, Prolog, ...), or an object-oriented inheritance hierarchy of node classes (Java, C++, ...), or as an S-expression (in Lisp languages).

The abstract syntax tree can be used as an internal representation in a language processor, but it is not the only possible representation.

An abstract syntax can be generated by a grammar in the following way: For every non-terminal type, there is a corresponding abstract syntax type. Each type has one constructor (or node type) corresponding to each production in the grammar, with one child for every symbol in the production that is not a literal token (e.g, punctation, keywords or spaces). If a constructor has only one child, of the same type, it can be removed (e.g., this would be the case for a parenthesis expression). You can do this process entirely based on the information contained in a parse tree. Translating a parse tree into a corresponding abstract syntax tree is called imploding the parse tree.

Given an abstract syntax tree, it is possible to reconstruct a parse tree or program text, given the original grammar – though the resulting program may be slightly different in terms of spaces and punctation. This is called unparsing or pretty-printing (particularly if the output is nicely formatted). Parsing, imploding, pretty-printing and then reparsing may not yield the exact same parse tree as the original tree, but it should still implode to the same abstract syntax tree (otherwise there is a bug in your tool chain!).

Although the abstract syntax may be derived from a grammar, it can actually be useful to define it yourself, to capture the core constructs in the language. Multiple cases in the concrete syntax may be folded into the same constructor in the abstract syntax, and the abstract syntax may distinguish between cases that aren’t syntactically distinct in the concrete syntax (e.g., in the case of overloaded operators).

It may be entirely sensible to design a language around the abstract syntax first, and then later on add a concrete syntax – this was done in the case of Lisp (where they basically ended up using a representation of the abstract syntax as a concrete syntax).

Various phases in a language processor may change the abstract syntax tree, or use slightly different versions of the abstract syntax (e.g., after type checking, the nodes for variables include the type of the variable) – it is also possible to decorate or annotate the AST as processing proceeds. This adds extra information to the nodes in the AST, without impacting the structure of the abstract syntax (this can also be done with a parse tree – in fact, the location information in Rascal is an annotation on the parse tree node).

Important abstract syntax design considerations are:

  • Simplicity. Generally, your compiler tools will do a lot of work on AST, and the fewer different cases you have to worry about, the better. For example, if the processing of overloaded functions and operators is basically the same (which it is to some degree in C++), you may want to have only one AST node type to cover both. Having a lot of unnecessary nodes in the tree can be annoying as well, and may make processing slower.
  • Good correspondence with the constructs of the language.
  • Availability of information during processing. Some information that can be computed from the tree (such as type information) might be encoded directly in the tree (at least at later stages) for easy processing.
  • Being end-user friendly or familiar to most programmers isn’t an important consideration – the abstract syntax may be radically different from the surface concrete syntax if that helps the compiler writer.

Updated