Clone wiki

inf225public / Exercise 4

Going Static and Imperative

We started out defining what is more or less a functional language (except that we're lacking anonymous functions). We'll now more in the direction of imperative languages like C, Pascal, Java, etc.

Code Base

The code is in the public repository, in the inf225ex4 project. You'll recognise most of the code from the lectures; but I've completed and debugged the code we made in the lectures, so that it is hopefully a reasonable basis for further work. The files are:

First Steps

  • Check out the project
  • Run the tests
  • Make a few more example programs and add them to the tests. In particular, complex interactions with multiple functions and deeply nested lets haven't been tested well, so you might find some bugs there.

The Exercise

Start with any one of the improvements below (they are not presented in any particular order), and implement it. Then, If you're having fun, or if you're having trouble, or if you are in any way uncertain about how to do the other ones, implement the other ones too. Make sure you study all of them carefully, even if you don't do the actual implementation. This stuff is on the curriculum, and you could get asked about it on the exam...

All of the improvement projects are designed to start from the same basis. To avoid confusion, you should, before starting the implementation work:

  • Create a new Rascal Project for each project, and copy the base files there, so you have a clean slate to start from; or

  • Keep everything in one project, and attempt to integrate the features and make the ultimate programming language.

Improvement A: Java-like Syntax

Our users on the interwebs are bored with the academic-looking syntax of our little language. Let's improve it, to be more Java-like and ‘easy’.

We should introduce a split between expressions and statements. Statements deal with control flow (sequential composition of actions (;); choices (if), loops etc.), while expressions deal with values. We'll also allow variable declarations as statements.

Syntax

  • First, we'll need to change our outdated function definition style. Let's make it look like this instead:

    int f(int a, int b) {
      stat1;
      stat2;
    }
    
  • Change the Decl syntax. It should now start with the return type, followed by the name, the parameter list (as before), then a list of statements (Stat*) between braces. You may also add global variables if you want (e.g., int pi = 3).

  • Introducing a new non-terminal Stat in the grammar.

  • Make if a statement instead of an expression. You can keep roughly the same syntax if you want (allow a list of statements in the then and else branches), or use the C style:

    if(c)
        stat1;
    else
        stat2;
    
  • If you used C-style ifs, you need a “compound statement” – a way to treat a list of statements as a single statement. The syntax for this is usually "{" Stat* "}".

  • Drop the sequence expression e1; e2. We'll do this using lists of statements instead.
  • We'll keep assignment as an expression, like it is in C. But we'll need a statement for executing expressions: e; It's best to put the semicolon as a terminator on the statement. You typically don't need a semicolon on control flow statements, since they are either terminated by a keyword (Pascal/Algol-style), or by a statement (C/C++/Java style).
  • Add a loop statement (while, perhaps?)
  • Add a return statement, which should take an expression.
  • Add a variable declaration statement. It should look like this: int x = e;. We'll drop the let-expression, instead, each variable declaration will have scope until the end of the statement list it occurs in.
  • Let's also change programs, so that they only consist of a Decl* list. The entry point for the evaluator could then be the function named main (or we could specify which function we want).

An example of a complete program:

int fac1(int n) {
  if n < 2 then
    return 1;
  else
    return n*fac1(n-1);
  end
}

int fac2(int n) {
  int r = 1;
  while n > 0 do
    r = r * n;
    n = n - 1;
  end
  return r;
}

int main() {
  int x = fac1(5);
  int y = fac2(5);

  return x == y;
}

(As you might imagine, it's difficult to write a correct program without typechecking. I've fixed a few bugs in the above program after running a typechecker on it. ;) )

Important Note: We now run the risk of having another ambiguity in the syntax: return (a+b); could be interpreted both as a return statement, and as a call to a function named return. The best solution to this is probably to make the keywords reserved, so they can't be used as identifiers. First, we need to make a list of keywords:

    keyword Keywords = "return" | "if" | "then" | "else" | "end" | "while" | "do" | "int" | "str";

Then we can say that an identifier can not be anything that matches Keywords:

    lexical ID = ([a-zA-Z_] !<< [a-zA-Z_] [a-zA-Z_0-9]* !>> [a-zA-Z_0-9]) \ Keywords;

The backslash operator eliminates whatever is to the right of it from the set of possible sentences allowed by the grammar fragment to the left.

Static Semantics

Now we have to update the type checker.

  • First of all, update the TestTypeChecker tests to use the new syntax. Perhaps also add a few new tests.

  • Then, change the checking of programs and top-level declarations. Note that processing of top level things is split into two: first we declare everything so that all the functions are in the environment, then we check the bodies. This allows calls between functions regardless of the order they're mentioned in.

  • You need a new set of check functions for statements:

    public tuple[Stat, Env[Type]] check(stat:(Stat)`...`, Env[Type] env) { ... }
    

Statements don't have values, so they don't have a type. But some statements can be variable declarations, so they might update the environment, hence we return an environment. We also return the typechecked statement, for reasons that will hopefully become clear at some point (note: reconstructing a Stat* from a list[Stat] turns out to be a bit tricky; better just return the original statement (stat) for now).

  • I recommend that you also add a check function for lists of statements (unless you have C-like if(c) { ... } syntax, where statement lists only occur in one place):

    public tuple[Stat*, Env[Type]] check(Stat* stats, Env[Type] env) { ... }
    

    or

    public tuple[list[Stat], Env[Type]] check(list[Stat] stats, Env[Type] env) { ... }
    

When processing a list of statements, loop over the statements and update the environment on each step (check the typechecking of programs if you need inspiration for the loop). *(Note: Stat* is the parsetree of a list of statements, and is distinct from list[Stat].)

  • Add typechecking rules for all the different statements. Usually, the body of a loop and the branches of an if are separate scopes, so you need to do enterScope/exitScope on the environment.

  • Typechecking return e; is a bit tricky, since you're supposed to check that the returned expression matches the declared return type. A possible solution: declare a fake variable __RETURN__ with before you check the function body, give it the proper return type, and when you hit the return statement, you can check the type of the expression against the fake variable.

Dynamic Semantics

  • Update the syntax in the tests.

  • The necessary changes to the evaluator follow the same pattern as for the typechecker. Note that both expressions and statements can update variables, so both must be able to update the store. Since statements can declare variables, they must also return an environment:

    public tuple[Env[Location], Store] eval((Stat)`...`, Env[Location] env, Store store) { ... }
    
  • Again, we need one eval function for each statement, and a loop or a utility function for processing statement lists.

  • Think carefully about scoping. Any time you enter or exit a new statement list, you enter or exit a scope.

  • Also think carefully about the store. The store should always be propagated along the flow of control – from one executed statement/expression to the next.

  • Return is even more tricky than for type checking. Not only do we have to transplant the return value from (potentially) a deeply nested statement, but we also have to avoid continuing execution with the next statement. The easiest fix is to use exceptions. Declare a new data type data Result = Result(Value val, Store store);. Surround the code that evaluate a function body with:

    try {
       ...
    }
    catch Result(rVal, rStore): // be careful, if the names in the pattern 
                                // are already bound, this will never match
      // do something with the value here
    

    then let the return evaluator simply do throw Result(val, store);. (yes, we need to propagate the store; the environment will be discarded anyway, so we can drop that)

    Once we make a compiler, there will be a more elegant solution for this.

Improvement B: Anonymous Functions and Closures

I lied. We'll also do one functional programming feature.

  • Check the Wikipedia page on anonymous functions, and the function semantics (§3.4) in the Notes on Evaluators and Dynamic Semantics.

  • Add syntax for anonymous function expressions (for example, "fun x,y,z => e", or "fn x,y,z => e" (ML style), or "lambda x,y,z: e" (Python style)). You probably also need to add type declarations to the parameters ("fun int x, str, y, int z => e"); but you can figure out the return type from the expression.

  • Update 12/10: You should probably also change the function call syntax to accept an expression as a function. E.g. (fun x => x*x)(2). Remember to update the corresponding check() and eval() functions.

  • Add an evaluation function for function expressions (should return a Fun(paramList, body, env) value; see the define function in the evaluator for how it's done with top-level functions.

  • Add a typechecking function for function expressions. It should recursively check the body and return the type of the function (see define and check for Decl in the typechecker).

  • Test, test, test!

  • Read the blue box on "The Power of the Simpl-Exp Language" near the end of the the Notes on Evaluators and Dynamic Semantics. Cool, eh? Try it. (Since this evaluator has a store, you should also be able to make mutable structures.)

  • Note: The fact that the anonymous function captures its surrounding environment, giving it access to the variables in the surrounding scope (even after that scope has been exited), makes it a closure – a closed package of everything you need to evaluate the function. This won't work if we combine it with stack-allocated variables, so it's less common in non-functional languages.

Improvement C: Simple Garbage Collector

Our current evaluator is horribly wasteful with the store. It'll allocate a new storage location for every variable in every new scope and never free them. As long as we don't have references in our language, adding a little garbage collector isn't too difficult.

The working principle is:

  • For any location in the store, if there is a variable somewhere which has that location allocated, we should keep it. Otherwise, we can free the location.

  • Any variables in the current environment, are of course active variables we need to consider.

  • There are also possible a bunch of inactive environments – from functions that have been called, and have called another function which have not returned yet. (Remember that the evaluation of a call happens in a different environment, disconnected from the environment of the caller.)

  • We need to get access to all the environments, so we can find all the variables.

What to do:

  • Env should no longer be just a list of scopes (alias Env[&T] = list[Scope[&T]];), but should also include a link to the next environment (the environment of the caller). For example:

    data Env[&T] = Env(list[Scope[&T]] scopes, Env[&T] chain) | Env(type[&T]);
    

    Note (14/10): There's a bug in Rascal's type system which will give you errors like Expected map[str, void], but got map[str, Type] when you wrap the environment in the Env(...) constructor. Storing the type of the environment in the tail element of the chain will fix this. newEnv() already takes the type as an argument, you can stick this in the chain tail when you construct the environment. E.g., public Env[&T] newEnv(type[&T] typ) { return Env([()], Env(typ)); }

  • You'll have to change all places in Environment.rsc that uses env[...] to env.scopes[...]. For enterScope/exitScope/etc, just to env.scopes = [...]; return env;. When you create a new environment, fill in the chain with Env(typ) – this terminates the chain.

  • Next, add a function Env[&T] link(Env[&T] env, Env[&T] chain) which puts links the second argument to the first. Note: this doesn't work, due to type issues in Rascal. Just to funEnv.chain = env; to make the link instead.

  • Then, when you do a function call, link the current environment to the funEnv used to evaluate the body.

  • Finally, write your garbage collector. It should take a store and an Env[Location], follow the chain of environments and the list of scopes in all environments, and collect a set[Location] of locations that are in use. You can than proceed through the store, set all unused locations to Int(0) (or add a special Void() value for this purpose), and return the garbage collected store.

  • Remember to also look inside the environments store in functions in active locations in the store.

  • Set it up so that the garbage collector is called at regular intervals, perhaps for every few variables allocated.

  • Test, and see that the evaluator still behaves correctly.

We still don't really save any space, since the store is an array which keeps growing. A better strategy is perhaps to make the store into a map[Location, Value] from which you can delete unused locations.

Reusing unused locations would perhaps make sense. This is a bit more complicated (you could make a list that keeps track of freed locations). Moving variables to new locations is sometimes done in real garbage collectors, but won't work with our environments.

Improvement D: Local Variables on the Stack

Another trick to save memory space, without needing a garbage collector, is to put local variables on the stack. Whenever a new scope is entered, the variables are pushed to the stack; on exit, the variables are popped, immediately deallocating the memory.

To make this work, we need to split the store into a memory and a stack:

data Store = Store(list[Value] mem, list[Value] stack);

The existing functions in Store.rsc should mostly use store.mem instead. Then, you need to add a push and a pop function. Push should probably work like allocate, and return a new location, which you then need to set using put. You can push to a list using [*store.stack, Int(0)] and pop using store.stack[..-1] (or just store the current top of the stack as an int).

public tuple[Location, store] push(Store store);
public store pop(Store store);

It would perhaps also make sense to allocate/deallocate multiple stack locations in one go (e.g., by adding an int argument to push/pop).

Then, you need to update function calls and variable declarations to use push to allocate, and pop to deallocate after the body has been evaluated. Top-level declarations are still handled using the normal allocate function.

Note (14/10): To avoid having separate get/set functions for the stack, and keeping track on which variables are where, you could adopt a convention like "negative locations are on the stack".

If we add references or pointers to the language, we may get into trouble if we pass around references to something from the stack, because the values will be lost when the stack is popped. This is a not entirely uncommon problem in C and C++.

Updated