Refactor the parser

Issue #589 new
Pol Welter created an issue

While the units update was still in the works, we had a discussion on this over on Hadrien's repo.

Let me summarize the discussion.

Pol Welter wrote:

The parser could indeed do with some improvements; just compiling the tokens into opcode is fairly limited. There are quite a few places that would benefit from being able to query the semantics of the expression. Think autocompleter, or the unit parsing you mention.

It would be very nice if the compiler would instead generate a parse tree (either abstract or concrete). For the evaluation we would just call an eval method of the root node, it would then recursively evaluate its children. This way, when a conversion operation is evaluated, the corresponding sub-tree would contain all the information of the unit, including the necessary string (either stored from the scanning step, or constructed on the fly from the nodes).

Such a tree based parser would also enable branched flows (think of the ternary operator ? : in C++), recursive user functions, pretty printing,...

and later ...

In particular the parser should also be able to produce a string (substring of the user input) that corresponds to the subexpression that is currently being evaluated. This is useful for the conversion operation, as here we need to extract the right hand side of the -> operator. For instance in (a+b -> c*d) * e we'd want to get c*d. Currently we only rely on the tokenized input, which is unsatisfactory, as it traverses the tokens twice (once for finding the string, and then once more when actually evaluating the expression), and applies precedence rules each time. Also I can imagine additions to the grammar (e.g. new operators) that will break this scheme. RPN for instance will totally wreck it as well.

A more reliable way would actually consult the semantics of the expression. This can only be done after the compiler has done its job. However, currently the compiler just produces opcodes (which are essentially RPN). The opcodes do not carry any information about the string anymore, so they are useless in that regard. (I had to hackishly include some strings into opcodes to make unit conversion work at all.)

We discussed two ways to tackle this:

  • I suggested rewriting the parser so that it generates a parse tree rather than opcodes. This represents quite a task, but will (probably?) solve the issues above. A tree contains all the information about the original input. RPN would be tricky, but possible.
  • Hadrien suggested that a (reduced) token should carry along [begin/end indexes of the string for the] original tokens it is made of. It is on this matter that I mentioned that RPN would still not work.

We concluded that RPN was not coming to SC anyway, so the latter might be a workable solution.

Comments (10)

  1. Tey'

    The parser_pr branch contains an implementation of Hadrien suggestion to carry information about the original text in each token. This removes the need to do extra parsing for the conversion operator at the end of the lexer. So far, the only difference with current implementation is that the result of 1 meter -> 10 meter is 0.1 10 meter instead of 0.1 (10 meter) (using 0.1 10 meter in an expression will be evaluated as 0.110 meter because of digits groups separators detection). Also, the unit string better matches what the user gives with this branch.

    Concerning branched flows and recursive functions, I managed to implement it using conditional jumps in the opcode stack (the same way it is implemented in programming language compilers), which didn't require so much work actually. Using a syntax tree would be better of course. The code could be more generic so I won't publish it yet. Also, I'm actually not so sure recursive functions will be really useful anyway: the only use case I can think of is sequences, but a spreadsheet application is more suited for that as users generally need to see the values of multiple elements rather than a single sequence element.

    Oh, and looks like we'll jump to the 1.0 version in a near future :o

  2. Pol Welter reporter

    Well, this 0.110 meter thing should be easy enough to work around. When checking if you need to fence the expression with parentheses, just verify whether or not the first non-whitespace character is contained in "0123456789e". (The e is for scientific notation like 1.5e2.)

    This will not fix the case like hex(123) -> 5 though...

    Recursive functions are not that useful indeed. But conditionals are.

  3. Helder Correia repo owner

    Oh, and looks like we'll jump to the 1.0 version in a near future :o - Did I scare you @Teyut? :) I've decided since 0.11 that 0.12 would be like an official beta for 1.0.

  4. Tey'

    just verify whether or not the first non-whitespace character is contained in "0123456789e". (The e is for scientific notation like 1.5e2.)

    It's fixed, thanks.

    This will not fix the case like hex(123) -> 5 though...

    What's the problem with the result?

    Did I scare you Teyut? :)

    Nah, but reaching version 1.0 is generally a big achievement :)

  5. Felix Krull

    The syntax right now is very loose and I don't think we're fully aware of what it does and doesn't allow. With issues like #634, I think the first step should be to create a formal definition of the syntax we want to support so we can hopefully get a better grip on what kind of inputs are valid and figure out whether they should be or not. Also, I like Hadrien's suggestion from that bug of a warning system to flag ambiguous syntax.

  6. Log in to comment