Issue #474 new

Ruby: Non-ASCII Method Names Not Recognised

Anonymous created an issue

Ruby 1.9 allows method names to include non-ASCII characters with the following caveats:

  • The characters must be valid in the file's source encoding.

  • A legal method name that does not end with '!', '?', or '=' may have one of these characters appended.

  • The ASCII punctuation characters of which operator methods consist (e.g. {{{[*%&^`~+-/\[<>=]}}}) must not appear in any other permutation, with the exception of the above case.

Pygments does not recognise such method names, lexing the first non-ASCII character as an error. Examples of unrecognised method names are given in http://pygments.org/demo/3147/ .

Reported by guest

Comments (3)

  1. thatch

    Do you have any reference to those rules, or perhaps the grammar itself? I checked the existing RubyLexer's rules and they're super-complicated:

                (r'(?:([a-zA-Z_][a-zA-Z0-9_]*)(\\.))?'
                 r'([a-zA-Z_][\\w_]*[\\!\\?]?|\\*\\*?|[-+]@?|'
                 r'[/%&|^`~]|\\[\\]=?|<<|>>|<=?>|>=?|===?)',
                 bygroups(Name.Class, Operator, Name.Function), '#pop'),
    
  2. thatch

    I did some digging. I still can't find a formal announcement, but local rubyers confirm that such support was "rumored."

    Checking the source (ruby 1.9 snapshot, `parse.y`) I see some code for this.

    #define is_identchar(p,e,enc) (rb_enc_isalnum(*p,enc) || (*p) == '_' || !ISASCII(*p))
    #define parser_is_identchar() (!parser->eofp && is_identchar((lex_p-1),lex_pend,parser->enc))
    ...
    
        mb = ENC_CODERANGE_7BIT;
        do {
            if (!ISASCII(c)) mb = ENC_CODERANGE_UNKNOWN;
            if (tokadd_mbchar(c) == -1) return 0;
            c = nextc();
        } while (parser_is_identchar());
        switch (tok()[0]) {
          case '@': case '$':
            pushback(c);
            break;
          default:
            if ((c == '!' || c == '?') && !peek('=')) {
                tokadd(c);
            }
            else {
                pushback(c);
            }
        }
        tokfix();
    
  3. Log in to comment