Wiki

Clone wiki

gauzaez / Lexer

Lexer

The lexer makes use of a tokenizer, which in this case is the practical representation of a deterministic finite-state automaton defined in conf/lexer_rules.json.

Each state of the automaton is called a node and has a set of paths P.

A path or pattern p is a set of transitions leading from the origin of P to one same node.

Each path is represented by a regular expression that can only be applied to a string x of length=1: x belongs to X


Know your automatae

The tokenizer is the combination of all the following automatons, using q0 as the start point.

Notes * Each time a transition takes place, the next character is used as x
* Transitions only evaluate one character, so ^ and $ have been omitted for clarity

  • Access:

  • regular expression:

    ^\.$

  • automaton:

    Access

  • Assignation:

  • regular expression:

    ^=$

  • automaton:

    Assignator

  • Binary Operator:

  • regular expression:

    ^(\||&|<<|>>|~|\^)$

  • automaton:

    Binary Operator

  • Block:

  • Open

    • regular expression:

    ^{$

    • automaton:

    Block Open

  • Close

    • regular expression:

    ^}$

    • automaton:

    Block Close

  • Brace:

  • Open

    • regular expression:

    ^\($

    • automaton:

    Brace Open

  • Close

    • regular expression:

    ^\)$

    • automaton:

    Brace Close

  • Comparator:

  • regular expression:

    ^([<>][=]?|[!=]=)$

  • automaton:

    Comparator

  • End of Statement

  • regular expression:

    ^[\n;]$

  • automaton

    End of Statement

  • Identifier:

  • regular expression:

    ^[_]*[a-zA-Z][a-zA-Z0-9_]*$

  • automaton:

    Identifier

  • Index:

  • Open:

    • regular expression:

    ^\[$

    • automaton:

    Index Open

  • Close:

    • regular expression:

    ^\]$

    • automaton:

    Index Close

  • Hexadecimal:

  • regular expression:

    ^0x[a-fA-F0-9]+$

  • automaton:

    Hexadecimal

  • Negation:

  • regular expression:

    ^!$

  • automaton:

    Negation

  • Number:

  • regular expression:

    ^[0-9]+(\.[0-9]+)?$

  • automaton:

    Number

  • Operator:

  • regular expression:

    ^(\+|\-|(\*[\*]?)|\/|%)$

  • automaton:

    Operator

  • Separator:

  • regular expression:

    ^,$

  • automaton:

    Separator

  • String:

  • regular expression:

    ^(")(?:(?=(\\?))\2.)*?"$ * automaton:

    String

  • Whitespace:

  • regular expression: ^ $

  • automaton:

    Whitespace

Updated