Issue #477 open

[patch] Refactor RegexLexerMeta for extensibility

Anonymous created an issue

I'm working on a project where I sadly can't use Pygments directly, but I am nonetheless trying to take advantage of its lexers, at least the ones that are purely regex-based (without overriding get_tokens_unprocessed). To do this, I refactored RegexLexerMeta so that there are separate methods to translate each component of a token tuple. My code then overrides those methods to, for instance, keep the regular expressions as strings instead of compiling them into matcher functions.

I think that the refactoring is generally useful and a good idea in its own right. Would you consider accepting it as a patch?

Reported by ArthurDenture

Comments (5)

  1. gbrandl

    First, I'm sorry I hadn't more than glanced at the patch before making false claims :)

    I think that increase in startup overhead is acceptable, since the lexer token defs aren't processed until the lexer is actually used. Therefore, taking a little bit longer there shouldn't be noticeable if one doesn't instantiate all lexers at once.

    I'll revisit this before the next release.

  2. Anonymous

    Yikes, I thought I was on the CC list, but I didn't get an email when you followed up. Sorry for the delay.

    This patch does not affect the RegexLexer loop; it affects RegexLexerMeta. Any overhead will be on instantiation, not during the actual highlighting.

    Anyhow, I measured that overhead with timeit using the following:

    timeit.Timer(stmt='PythonLexer(); del PythonLexer._tokens', setup='from pygments.lexers.agile import PythonLexer').repeat(5, 10000)

    The fastest run on stock pygments was 2.80 seconds, and the fastest run in my patched version was 3.21 seconds, meaning that this patch adds about 15% overhead. That's a significant percentage, but keep in mind that in absolute time, 3.21 seconds * 1000 (ms/s) / 10000 runs is 32 ms instead of 28 ms to instantiate a single lexer. Is this increase in startup overhead acceptable?

  3. gbrandl

    The reason the RegexLexer loop is in one block is one of performance.

    E.g. for the matching in the loop, I suspect that an additional method call for every regex that is tried will cause a noticeable drop. It would be good if you could measure that.

  4. Log in to comment