Support for unicode character categories in regexps

Create issue
Issue #1516 resolved
Clément Pit-Claudel created an issue

Hi all,

(Cross posted from IRC)

I'm improving the Pygments lexer for Coq. The language I'm adding support for defines its identifiers using Unicode categories (identifiers start with a character with class Lu, Ll, Lt, Lo, or Lm, then have zero or more characters in these or Nd, Nl, No). How do I write a Pygments lexer for this? The newer regex module has support for matching these properties (using e.g. \p{Lu}), but the Python re module doesn't have a similar feature.

Thanks!

Comments (3)

  1. Georg Brandl repo owner

    Pygments has the pygments.unistring helper module for that; using it is not as pretty as regex, but using regex instead of re will be a pretty huge task to ensure compatibility.

    (It may be possible to have a lexer subclass that automatically pre-processes the \p{...} escapes in regexes before passing them on to re…)

  2. Log in to comment