Tokenizer conformance to spec

Issue #582 resolved
Felipe created an issue

Not sure how I hadn't found this before, but the formula syntax is included in ISO/IEC-29500-1 §18.17, which can be found in the publicly available standards section of the ISO website. (Ctrl-F for ISO/IEC 29500-1). Reading through the spec, I noticed a variety of issues with the tokenizer:

  • #GETTING_DATA is not recognized as a valid error code
  • Scientific notation only accepts capitalized E (should accept both capital and lowercase)
  • + is not allowed as a prefix operator (though this one probably causes no harm)
  • , is always treated as a separator (should be an infix operator for reference unions)
  • : is treated as part of a range reference (should be an infix operator for reference ranges)
  • whitespace is not dealt with properly. As an infix operator, it intersects ranges. Otherwise it should be ignored according to:

In a formula, an arbitrary number of space characters (U+0020) can precede the first token or follow the final token. An arbitrary number of space characters can separate two adjacent tokens, except that no space characters shall separate a function-name from the left parenthesis (() that follows it. Such space characters have no effect on the semantics of a formula; however, such spaces shall be distinguished from the space operator

Comments (14)

  1. CharlieC

    Thanks. Any chance of you working on these yourself?

    FWIW ISO 29500 is the same as ECMA 476 which is linked to from the docs.

  2. Felipe reporter

    Yep, absolutely. I was just collecting these as I found them while responding to other issue about named ranges vs cell references. Re: ECMA 476, where is it linked? Hadn't seen that before...

  3. CharlieC

    Thanks. BTW. I do think the tokeniser could do with some profiling though it's already pretty fast. I'm sure that files like the one in #494 could benefit from that. I've just been looking at the code.

    hm, maybe I just thought it was linked. I should probably add a paragraph about the spec to the docs page. Microsoft has done a reasonable job of burying the spec in the various standards bodies! ;-)

  4. Felipe reporter

    Yeah you're probably right. I started to toy with the idea of re-writing the whole thing as a single regex with a post-process pass to ensure proper categorization of tokens in this gist but never got around to fleshing that out/benchmarking performance

  5. CharlieC

    I suspect it won't make too much difference to be honest. When reading large files the biggest overhead is creating all the cells but I suspect translating the formulae might also be a bit of a break. Of course, people shouldn't really be using Excel for such large volumes of data! ;-)

  6. CharlieC

    Also: any reason why parse() can't be called by the init ? Keep forgetting to call it in client code.

  7. CharlieC

    @felipeochoa FWIW I've almost finished moving from the old style handling of definedNames and will rely solely on the Tokenizer for detecting cell (or row / column) ranges. It's already very good at this.

  8. Felipe reporter

    That's great to hear! I've fixed the first two points, which I think are the most important, since they could cause the tokenizer to hang on an otherwise valid formula. The others I'm inclined to leave as wontfix for the time being, since it's merely an issue of categorizing the tokens. It could easily be fixed with an extra pass at the end, but I don't want to add that performance penalty until we need the new functionality. @charlie_x Which branch should I PR to?

  9. CharlieC

    Please use the 2.4 branch. Functionality can be extended in the future: it would for example be nice to check user-created formulae against known ones so that people don't get caught out by the "localisation trap" and only find out that they've entered rubbish when Excel complains about the file. But in the most part we're just keeping what's in the original file.

    PS. make the internal methods private and have parse run at the end of the __init__ unless you have a good reason not to do so. The API should be: create an instance and read the parsed items.

  10. CharlieC

    The following test currently passes but should fail:
    ("B1namedrange", 'RANGE')

    Will your changes fix this?

  11. Felipe reporter

    Sorry I thought I'd responded already. In my Excel (2013) B1namedrange is a valid name for a range. Based on my understanding, the only forbidden names are those that look exactly like a cell reference. In this case, because of thenamedrange suffix, this is correctly parsed as a single reference.

  12. Log in to comment