Regex character class does not allow unescaped hyphen at final position

Issue #52 resolved
Jesper Lindholm created an issue

The Hime grammar language reference says that regular expressions are supported and defers to an IEEE standard. In the 7th bullet point in section 9.3.5 RE Bracket Expression, which defined character classes, it says the following:

"The hyphen character shall be treated as itself if it occurs first (after an initial '^', if any) or last in the list, or as an ending range point in a range expression. As examples, the expressions "[-ac]" and "[ac-]" are equivalent and match any of the characters 'a', 'c', or '-'[..]"

In other words, including a hyphen without escaping it is permissible if done at the end. This is also congruent with every standard regex library I've used in C#, Ruby, Perl, PHP, etc.

But when I define a terminal like so:

VARIABLE_NAME -> [A-Za-z0-9-]+;

it does not match 'abc-xyz'. Whereas if I escape the hyphen:

VARIABLE_NAME -> [A-Za-z0-9\-]+;

it matches correctly.

It is not a huge bug but as a silent discrepancy from both specs and common implementations it would be a good thing to not have to look out for.

Comments (6)

  1. Jesper Lindholm reporter

    It seems as if escaping the hyphen does not work either and instead caused some sort of himecc error at generation time. I had to remove it for the time being.

  2. Laurent Wouters

    Hello! Thanks you for the feedback, this is a nice find. I should be able to fix this in a few days. In the meantime, it looks like you can work around the issue. Thanks!

  3. Jesper Lindholm reporter

    I got this to work in 3.3.2.

    For anyone who might be interested and maybe found this from searching... I got weird results because there was an opportunity for ambiguity in my grammar:

    VARIABLE_NAME -> [A-Za-z0-9-]+;

    ...meant that a hyphen could exist in the first position, which caused problems with other parts of my grammar allowing for unary minus. I solved this by disabling the hyphen in the first position:

    VARIABLE_NAME -> [A-Za-z0-9] [A-Za-z0-9-]*;

    Thanks for a great product.

  4. Log in to comment