Regex character class does not allow unescaped hyphen at final position

Issue #52 resolved
Jesper Lindholm
created an issue

The Hime grammar language reference says that regular expressions are supported and defers to an IEEE standard. In the 7th bullet point in section 9.3.5 RE Bracket Expression, which defined character classes, it says the following:

"The hyphen character shall be treated as itself if it occurs first (after an initial '^', if any) or last in the list, or as an ending range point in a range expression. As examples, the expressions "[-ac]" and "[ac-]" are equivalent and match any of the characters 'a', 'c', or '-'[..]"

In other words, including a hyphen without escaping it is permissible if done at the end. This is also congruent with every standard regex library I've used in C#, Ruby, Perl, PHP, etc.

But when I define a terminal like so:

VARIABLE_NAME -> [A-Za-z0-9-]+;

it does not match 'abc-xyz'. Whereas if I escape the hyphen:

VARIABLE_NAME -> [A-Za-z0-9\-]+;

it matches correctly.

It is not a huge bug but as a silent discrepancy from both specs and common implementations it would be a good thing to not have to look out for.

Comments (6)

  1. Jesper Lindholm reporter

    I got this to work in 3.3.2.

    For anyone who might be interested and maybe found this from searching... I got weird results because there was an opportunity for ambiguity in my grammar:

    VARIABLE_NAME -> [A-Za-z0-9-]+;

    ...meant that a hyphen could exist in the first position, which caused problems with other parts of my grammar allowing for unary minus. I solved this by disabling the hyphen in the first position:

    VARIABLE_NAME -> [A-Za-z0-9] [A-Za-z0-9-]*;

    Thanks for a great product.

  2. Log in to comment