Clone wiki

inf225public / Overview

What is a language?

A language is a form of communication, with structure (syntax) and meaning (semantics).

For our purposes:

  • A formal, human-made language
  • Has a grammar
  • Can be parsed
  • Can be processed by a computer

Formal language definition

  • It's described by a grammar: a tuple <N,T,P,S>

    • Non-terminals – kinds of phrases
    • Terminals – tokens, words
    • Start symbol
    • Productions – depending on kind of grammar, for example
    • Context-free N := (N|T)*
    • Regular: N := T N or N := N T
  • It's all the strings over the terminals which are valid according to the grammar

Syntax

Languages and Grammars

Discussed in: Lession 1, Lession 2, Exercise 1

  • Context-free and regular grammars – used in defining the syntax of formal languages

  • Grammars describe the concrete syntax of a language

  • Parsing determines if a text is legal with respect to the grammar and builds a tree/structure for further processing

  • Context-free grammars are typically written in (some variant of) BNF – Backus-Naur Form; regular grammars are typically written as regular expressions.

Classes of languages

  • Regular – can be parsed by a finite-state machine; extremely efficient

    • Can't specify that things should be nested. E.g., count parentheses – this would require a stack.
  • Context-Free – can be parsed by a push-down automaton (FSM + a stack); very efficient

    • Can't specify that something should be declared before use (incl. type checking), or support user-define delimiters (e.g., as in MIME multipart email messages).

    • Solution: do it anyway, and check after parsing.

  • Context-Sensitive – Linear-bounded non-deterministic Turing machine; complicated

  • Recursively enumerable – turing machine; very complicated

Context-Free Languages

  • Used in programming languages – most programming langs more or less context-free

    • C++ is slighty outside

    • Many languages are in subclass LALR

    • Grammars in standards often don't correspond to those in the implementations, and might be internally inconsistent

    • In practice, parsers for CFGs in realistic languages are often hand-implemented. This is due to the need for good error reporting.

  • A context-free language has a context-free grammar. In practice, the language is context-sensitive, but this is dealt with after syntax processing.

  • The syntax of a CFG defines the structure; says nothing about the semantics.

Regular Languages

  • Can't defined languages with nested constructs.

Parsing

Discussed in: Exercise 1

  • Scanner/lexer – for regular grammars. Can be implemented using a finite-state machine (FSM)
  • Parser – for context-free grammars. Can be implemented using a push-down automaton (PDA – an FSM with a stack)
  • Scannerless parsing – used in Rascal and SDF2/SGLR. Useful in language composition and language extension, and for some language like Markdown, TeX, various Wikis etc.
  • Parser generators – create an executable parser automatically from a grammar
  • Generalised parsers – ([GLR][], GLL) can parse any context-free grammar. Rascal, SDF2/SGLR and Bison are systems with generalised parsing support. Useful in language composition and language extension.
  • Classical parsers ([LALR][], [LL][]) can only parse a subset of context-free grammars – and combining or extending grammars may require extensive rewrites. Some example systems are Yacc and Bison (both LALR) and ANTLR (LL(k) / LL(*)).

Parse Trees

Discussed in: Lession 2

  • A structured representation of a text, with respect to a grammar (and the parsing process)

  • Contains details of the concrete syntax

Abstract Syntax Trees

  • A structured representation of a program, without
    • details related to concrete syntax (which is needed to unambiguously encode the program as text),
    • details related to the parsing scheme or the parsing process.
  • May be derived from the grammar (concrete syntax);
  • Or, may be designed as a program representation unrelated to the concrete syntax.

Generalised parsing

  • Parses the entire class of context-free grammars

  • Produces a parse forest

  • Can be used for context-sensitive languages:

    • Make a context-free grammar with ambiguities

    • Resolve ambiguities later

  • Reasons why you'd use it:

    • Grammars can be a lot nicer, useful in a standard or a textbook

    • You can combine different languages, for example in a meta-programming system

    • Tradeoffs: can be slower (n³), parse forest generation can take time

    • Grammars are prettier, might be easier to write and easier to modify

    • Non-generalised grammars might need a lot of changes to do a small modification

    • CFG are closed under composition. If A, B are CFGs, then A + B is a CFG. Not true for restricted classes like LALR.

    • But: parse results may be ambiguous, which is normally undesirable. Might be hard to debug

Ambiguities

Discussed in: Exercise 2

  • An ambiguous grammar has more than one valid parse tree for a valid input. For example, a + b + c – should this be (a + b) + c or a + (b + c)?
  • If a grammar is ambiguous, generalised parsing produces a parse forest instead of a tree – all the possible trees that can be obtained.
  • Simpler parsing schemes typically don't allow ambiguous grammars.
  • The union of two unambiguous grammars can be an ambiguous one.
  • Ambiguous grammars are difficult to detect statically (undecidable in general).
  • In Rascal and with SDF2/SGLR, you'll find amb nodes in the parse tree if there is an ambiguity.
  • Expression ambiguities, like with a + b + c are typically resolved with operator precedence rules and associativity rules.
  • Other cases may be more tricky, such as with the dangling else problem.

Precedence / Priorities

Discussed in: Exercise 2

Scanners vs. Scannerless

Discussed in: Lesson 6

Spaces and Layout

Discussed in: Lesson 6, Exercise 2

Domain-Specific Languages (DSLs)

Discussed in: Lesson 5

  • What – targeted at and restricted to domain, has a language interface.
  • Why – easier programming, more efficient or secure, possibly better error reports.
  • Why not – lots of implementation work, language fragmentation, learning/training issues, less tooling, troublesome interoperability.
  • How – external DSL (separate programming language), internal DSL (language-like interface as a library).

Semantics

  • Syntax gives structure, semantics gives meaning

  • Can be specified formally, for example using (big-step) operational semantics. Can also be done informally, e.g., using English text, or via a reference implementation.

  • The nicer, more structured your implementation is, the better it specifies the semantics.

Types

  • Used to validate programs in static semantics

  • Used to specify data layout (and for safety checks) in dynamic semantics

  • We've looked at

    • Plain data types; ints, etc.

    • Function types

    • Structure types (records) – a little bit

  • We can have various degrees of strong or weak typing (strong being usual nowadays), and static and dynamic typing (both are popular nowadays).

Dynamic semantics

  • Dynamic semantics gives the meaning of the program at execution time.

  • We've specified it using evaluators

  • Used to define code generation in a compiler

  • Dynamic semantic rules tell you:

    • How things are computed

    • How environment and store are propagated throughout the computation

Static semantics

Includes:

  • Additional constraints on what a legal or meaningful program is

  • Rules for which names refer to what; names are linked to declarations – this is used to work around the context-freeness of the syntax.

...

  • Semantics "executed" at compile time, using types instead of values.

  • Reject meaningless programs

  • We've dealt with static semantics in typecheckers; it "evaluates" the static semantics, gives you yes/no answer to wellformedness / legality, and possibly a annotated program.

  • Only in static(ally typed) languages, not in dynamic languages like Python or Lisp.

  • Important to code generation – used to figure out the memory layout and size of variables, and how they should be allocated.

Environment and Store

  • Environment contains bound variables – variables that correspond to a declaration

  • Variables may be bound to values or locations (if we have a store).

  • If every step in the program execution produces a new environment, we can change the value of variables without having a store. But we'd need a store to have references.

  • Store is an abstraction of the machine memory.

    • Store has locations

    • Each location contains a value

    • Each program step may update the store, resulting in a new store

    • (In a language with a store) Variables are typically bound to store locations.

    • Get value from store on variable use, put value in store when assigning to variable

    • This is used for references. When the value of a variable is a location, you have references in your language.

    • In real life: store is heap memory and stack. Locations are addresses, either relative to stack pointer, or on the heap (main memory). Stack is used for temporary / local variables. Heap is used in dynamic memory allocation (e.g., new in C++ and Java, malloc in C).

    • You could use an array or a map to implement it. (Array would most machine-like)

    • Static vs. dynamic scoping – what is the environment used when evaluating a function?

    • Compilers typically implement the enivornment as a symbol table – either a simple hash table, or a more complex structure with support for nested scopes.

  • Garbage collection is going through the heap (store) in order to find values that can't be accessed anymore (by any available location). Often done by a mark and sweep algorithm.

  • Reasoning about semantics is very difficult if you allow multiple variables to point to the same location. This is called aliasing. Compilers will try to do alias analysis in order to do optimisations. This is not a problem in languages with immutable values (e.g., Haskell), or without aliasing (e.g., Magnolia).

Updated