Wiki

Clone wiki

inf225 / 2017 / ex / Exercise 2

Note: Feel free to skip stuff you don't understand yet; you can catch up with it later. And do also feel free to study the manuals and experiment on your own if you have the time. Please let Anya know if you find any mistakes (yes, something being unclear or incomprehensible is a mistake in the exercise, not in you).

2.1 Introduction

Finish 1.3 and 1.4 of Exercise 1 if you haven't yet done so.

2.2. Simple Logic

  • Create a grammar for simple logic expressions without variables (e.g, "true or false", "true", "true xor true", "!true").
    • You need a single non-terminal LogicExpr, with multiple productions.
  • Add variables to the grammar. You need a lexical non-terminal for identifiers (non-empty sequence of letters for example). You should now be able to type something like "a or b", "!b". (Note that the variables don't really have any meaning yet – we can say "!b" but we won't know if "b" is true or false.)
  • Add other logic operators (e.g, implication "=>", "forall")

Put the grammar in a Rascal module LogicExpressions. Import it from the Console (import LogicExpressions;), and try a few simple expressions:

(LogicExpr)`true`
(LogicExpr)`!b or c`

2.3. Logic with precedence

Classic logic operators have a defined order of precedence where AND binds tighter than OR; i.e., a OR b AND c = a OR (b AND c).

In Rascal, you can express precedence by using < between alternative productions rather than |. For example:

syntax E = E * E
         > E + E   // * has higher priority than +

To enforce left associativity (i.e, that a AND b AND c is interpreted as ((a AND b) AND c) rather than (a AND (b AND c))), put left in front of the production rule:

syntax E = left E * E
         > left E + E
  • Add precedence and associativity to your Rascal grammar. Try it with expressions like a or b and c, a or b or c, and so on.

Encoding with Extra Non-Terminals

You can also encode this by using additional non-terminals. If AND has higher priority than OR, you know that an AND expression can't have OR expressions as operands – that would imply that (a OR b AND c or d) = (a OR b) AND (c OR d). You will need one non-terminal per precedence level, plus one for atoms (such as true/false/identifiers):

  • AtomExpr – has no arguments
  • AndExpr – can be an AtomExpr, or an AND operator with other AndExpr (but not OrExpr) as arguments
  • OrExpr – can an AndExpr or an OR operator with other OrExpr as arguments

To account for associativity, with further restrict the operands. E.g, if we want a AND b AND c to mean ((a AND b) and c), we need to make sure that and expressions are only allowed to have other and expression on the left side (for left associativity). This means that an AndExpr would look something like this:

syntax AndExpr = AtomExpr
               | AndExpr "AND" AtomExpr;  // where AtomExpr is the next lower level in the precedence order

(See page 44–45 in Programming Languages Concepts and Constructs)

  • Make a copy of your grammar (in the same file), rename the non-terminals and encode precedence and associativity in the way described above. Try it with expressions like a or b and c, a or b or c, and so on.

Testing the parser

  • Make a test function for this, that takes a string and parses it using both grammars and checks that both parses either succeed or fail. For this to work, you need the following bits of Rascal-fu:

    • To parse a string, add import ParseTree; at the top of the file, and call the parser like this: pt = parse(#NonTerminal, string), where pt will be the parse tree result, NonTerminal is the name of the non-terminal in the grammar, and string is a string (you can also use a location value in order to parse a file).

    • This function tries to parse a string and returns true if it succeeds and false otherwise: bool parses(str s) = parse(#Expr, s)?; (something that returns just true or false rather than a parse tree is typically called a recognizer)

    • A function declaration looks like this: bool grammarTest(str s) { ... }

    • if and logic operators in Rascal are similar to in Java

Seeing the structure of the tree

TODO: look for even better ways of doing this

When you look at the parse tree, Rascal prints it back to you just like the original string – so you can't really tell if the associativity and such is really correct. You can check this in different ways:

  • (Easy, but hard work) Look at the raw parse tree: just wrap you value in a list or something, and Rascal will drop its fancy formatting: [parse(#Expr, "a or b")]

  • (Decent solution) Make a small function that adds parentheses to the tree. This would look like this (replace Expr with the name of your non-terminal – you'll also need to add parentheses to the syntax definition to make this work):

Expr addParens(Expr t) = bottom-up visit(t) {
    case (Expr)`<Expr e1> + <Expr e2>` => (Expr)`(<Expr e1> + <Expr e2>)`
    case (Expr)`<Expr e1> * <Expr e2>` => (Expr)`(<Expr e1> * <Expr e2>)`
};
  • (Another decent solution, but extra work) Make a pretty-printer (see later exercises).

  • (Needs more skills) Evaluate the expression and see if the result is correct.

  • Any other ideas? (You can compare trees recursively, for example)

If you have a way of showing the tree structure (e.g., by adding parentheses), you can compare the structure of trees parse with your two different grammars (they should be equal). For example: unparse(addParens(parse(#LogicExpr, "1+2+2+3+4"))) == unparse(addParens(parse(#LogicExpr2, "1+2+2+3+4")))

2.4 List of Things and Regular expressions vs Context-Free Grammars

Regular expressions are a way to express regular grammars, and regular grammars are a subset of context-free grammars. So, all regular expressions can be rewritten in the style of a context-free grammar.

Rascal's context-free grammars include many of the same operators that are used in regular expressions, which makes it a bit easier to write productions for lists of things. For example, the regular expression a*b* matches zero or more a's followed by zero or more b's. We can write a similar production in Rascal:

syntax Foo = A* B*;

Without such operators, you'd have to express lists in this way: "A list of A's is either empty, or it is an A followed by a list of A's:

syntax Foo = ListOfA ListOfB;
syntax ListOfA = // empty
               | A ListOfA
               ;
syntax ListOfB = // empty
               | B ListOfB
               ;

Optionals (e.g., a?) is handled in a similar way: syntax OptionalA = /* empty */ | A;.

Lists with separators are similar. In Rascal, there is a short-hand for this: {A ","}* is a list of zero or more A's, separated by ,. Without the shorthand syntax, we'd have to write:

syntax CommaSeparatedListOfZeroOrMoreAs =
    /* empty */
  | CommaSeparatedListOfOneOrMoreAs
  ;
syntax CommaSeparatedListOfOneOrMoreAs =
    A
  | A "," CommaSeparatedListOfOneOrMoreAs
  ;

Exercise

  • Write grammars for the following, with and without using Rascal's shorthand notation for lists:
    • The sheep language; "bæ", "bæææ", "bææææææ", ...
    • A list of comma-separated function arguments: (), (a), (a, b, c), ....
    • The regular expressions ((xy*x)|(yx*y))?
    • Numbers, consisting of one or more digits

2.5 More grammar exercises

Write grammars for:

  • Palindromes over the alphabet {a, b}; e.g., aa, abba, babbab, ...
  • A sequence of zero-or-more as followed by zero-or-more bs
  • A sequence of zero-or-more as followed by fewer bs
  • Matching parentheses, e.g., ((()))()()
  • Matching parentheses and brackets, e.g., (([[]])[][])

2.6 Advanced

(From Appel's Modern Compiler Implementation in ML)

  • Write a grammar for English sentences using the words time, arrow, banana, flies, like, a, an, the, fruit and the semicolon. Be sure to include all the senses (noun, verb, etc. – e.g. flies can be both a verb and a noun (plural of fly)). Then show that this grammar is ambiguous by exhibiting more than one parse tree for "time flies like an arrow; fruit flies like a banana".

2.7 The Simple Language

We'll now experiment a bit with grammars and syntax in the Rascal meta-programming language.

  • Note: Whenever you edit a Rascal program in Eclipse and save it, it will be automatically reloaded in the console. (Unless it contains errors. Who knows what will happen then.)

  • Note: Rascal may sometimes pop up an annoying (syntax) error message; in most cases you can just click OK, and continue editing your program, and eventually it'll stop appearing...

Consider again the grammar of the Simple language from Exercise 1:

start syntax Program = Expr;

syntax Expr
    = ID                  // variables
    | NUM                 // integers
    | Expr "+" Expr       // addition
    | Expr "*" Expr       // multiplication
    | ID "(" Expr ")"     // function call
    | "(" Expr ")"        // parentheses
    ;

// identifiers
//    y !<< x means 'x' must not be preceeded by  'y'
//    x !>> y means 'x' must not by followed by 'y'
// so, this means that an identifier is a sequence of one
// or more letters or underscores, with no additional
// letters or underscores before or after
lexical ID = [a-zA-Z_] !<< [a-zA-Z_]+ !>> [a-zA-Z_];

// numbers
lexical NUM = [0-9] !<< [0-9]+ !>> [0-9];

Edit the Simple.rsc module again, and add the following code:

public str pretty(Program p) {
    if(Expr e := p)
        return pretty(e);
    else
        return "[unknown program: \"<e>\"]";
}

public default str pretty(Expr e) {
    return "[unknown expr: \"<e>\"]";
}


public str pretty((Expr)`<ID i>`) {
    return "<i>";
}

public str pretty((Expr)`<NUM n>`) {
    return "<n>";
}

public str pretty((Expr)`<Expr e1>+<Expr e2>`) {
    return "(<pretty(e1)>+<pretty(e2)>)";
}

public str pretty((Expr)`<Expr e1>*<Expr e2>`) {
    return "(<pretty(e1)>*<pretty(e2)>)";
}

public str pretty((Expr)`(<Expr e>)`) {
    return "(<pretty(e)>)";
}

public str pretty((Expr)`<ID f>(<Expr e>)`) {
    return "<f>(<pretty(e)>)";
}

This defines a simple pretty-printer for Simple programs. Important things to note:

  • The functions are overloaded with pattern matching on the arguments. In most of the cases, we use concrete syntax patterns.
  • The pattern (Expr)`<Expr e1>*<Expr e2>` denotes something which is an Expr; the concrete syntax of what we're matching is contained in the back-quotes. The sub-expressions <Expr e1> will match any Expr and bind it to a meta-variable e1. The * is the multiplication symbol from our Simple grammar.
  • Inside the strings we return, we use string interpolation. Anything within <...> in a string is a Rascal expression which is evaluated and converted to a string. Converting a parse tree (as in ID i) yields its program text; for the other cases we do a recursive call to pretty.
  • You can try out a concrete syntax pattern using the pattern matching operator := in Rascal:
rascal>(Expr)`<Expr e1>*<Expr e2>` := `x*y`
bool: true

rascal>if((Expr)`<Expr e1>*<Expr e2>` := `(a+b)*y`) println(e1);
(a+b)
ok

2.7.1 Fixing Ambiguities

First, try the pretty function on some programs. Does it always work, or are there things for which you get an error? In particular, try it with expressions such as `a+b*c`.

The errors you (hopefully) get is due to some expressions being ambiguous. You can explore the ambiguities with Dr. Ambiguity:

rascal>import Ambiguity;
ok

rascal>diagnose(`a+b+c`)
list[Message]: [
  info("Ambiguity cluster with 2 alternatives",|stdin:///|(10,5,<1,10>,<1,15>)),
  info("The alternatives use the same productions",|stdin:///|(10,5,<1,10>,<1,15>)),
  info("The alternatives have the same production at the top: Expr =  Expr  \"+\"  Expr ",|stdin:///|(10,5,<1,10>,<1,15>)),
  error("This rule [Expr =  Expr  \"+\"  Expr ] may be missing an associativity declaration  (left, right, non-assoc)",|stdin:///|(10,5,<1,10>,<1,15>))
]

  • a) Try it yourself with a few expressions. The problems are due to operator precedence and associativity, as we've discussed in the lectures.

The grammar should have different priorities for addition and multiplication. Priority is expressed by using > instead of | when separating alternatives in the grammar. This gives multiplication higher priority (or precedence as it is also called):

    ...
    | Expr "*" Expr       // multiplication
    > Expr "+" Expr       // addition
    ...

Consult the Rascal priority rules for more information.

  • b) Introduce priorities into the grammar, and try the ambiguity diagnosis again on a few expressions. Is the situation better? Does `x+y*z` work? What about `x+y+z`?

Associativity is expressed using the left, right or non-assoc keywords. You can use it either on a single alternative in a production:

    | left Expr "*" Expr

or multiple ones, if you want operators to be mutually associative:

    | left (Expr "+" Expr | Expr "-" Expr)

  • c) Make multiplication and addition associative.

  • d) Use pretty to check the nesting of expressions; it will put parentheses around each expression, so you can easily see which ones belong together. Change the associativity to right and/or non-assoc and study the effect.

  • e) Add alternatives for subtraction and division to the grammar, and make the mutually associative with addition and multiplication.

  • f) To make pretty work for subtraction and division, you'll have to add two more overloaded cases of the pretty function.

2.7.2 Spaces and layout

You may have noticed that you can't write programs with spaces in them. That's because you haven't defined syntax rules for spaces. You do this with the layout construct in the Rascal syntax definition.

For example:

// layout is lists of whitespace characters
layout MyLayout = [\t\n\ \r\f]*;
Each terminal or non-terminal in the right-hand side of a production may be separated by layout. This rule says that layout is zero or more tabs, newlines, normal spaces, carriage returns or form feeds. This is typical in modern programming languages.

Comments are typically also considered layout.

We will look into this more in the future.

2.8 Grammar ambiguities

The following grammar is motivated by declarations in C:

Declaration ::= Type Declarator ";"
       Type ::= "int" | "char"
 Declarator ::= "*" Declarator
              | Declarator "[" number "]"
              | Declarator "(" Type ")"
              | "(" Declarator ")"
              | name
1. Prove the syntactic ambiguity of this grammar by finding a string that has more than one parse tree. Draw the parse trees.

  1. The constructs [ number ] and ( Type ) can be thought of as being postfix operators with a declarator as an operand. Suppose that * has lower precedence than these operators. White an unambiguous grammar that generates the same strings as this grammar and makes it easy to identify the components of a declarator.

  2. Suppose that the first production for Declarator is changed to make * a postfix operator. Why is the resulting grammar unambiguous?

Updated