pygments is too slow to use on large files in a heavily-loaded production environment

Issue #508 resolved
Former user created an issue

On my local machine ( Core2 Duo 2.66GHz ), here are the times to run pygmentize using pygments 1.3.1 on a Perl file, based on file size:

{{{ 131K file: 1 second 261K: 2 seconds 391K: 2.8 seconds 521K: 3.8 seconds }}}

In other words, about 1 second per 128K.

For a python file, I get these times:

{{{ 115K: 0.697s 230K: 1.190s 345K: 1.773s }}}

It continues to be linear--about 0.6 seconds for each 128K or so.

And that's on my local machine that's not really doing anything else.

The problem is that if I have another system that already takes several seconds to process and display a file, and that file is large, then I can't use pygments on the file, because it will make the whole display process take too long.

If we add multiple threads running pygments at the same time (and a server generally doing other things) then pygments is too slow to use on large files at all.

Here's the bug where I originally found the problem:

https://bugs.launchpad.net/loggerhead/+bug/513044

Reported by mkanat

Comments (5)

  1. thatch

    (If your bug is about the relative performance of highlighting perl vs python, please correct me -- I'll assume it's a comment on the general slowness of pygments given the reference to pyrex in the loggerhead bug).

    I instrumented pygmentize running on some python that I managed to cobble together that's 345KB. Mine takes 2.050s on a 2.4GHz Core 2 Duo running OS X so the timings are not that different. If you'd like me to take a look at your source file, then I need a copy of your source file.

    The most expensive calls are:

       ncalls  tottime  percall  cumtime  percall filename:lineno(function)
       885313    0.545    0.000    0.545    0.000 {built-in method match}
        87827    0.502    0.000    1.120    0.000 lexer.py:467(get_tokens_unprocessed)
        11105    0.445    0.000    2.564    0.000 html.py:609(_format_lines)
       439132    0.212    0.000    0.212    0.000 {method 'replace' of 'unicode' objects}
       177299    0.175    0.000    0.207    0.000 token.py:43(__hash__)
        87826    0.156    0.000    0.512    0.000 html.py:377(_get_css_class)
        87826    0.156    0.000    0.367    0.000 html.py:24(escape_html)
    

    I'm able to shave off a few tenths of seconds from _TokenType.hash and escape_html, but that's about it. Checking the regexes themselves, these are the slowest ones (timings are 2x the number above, due to being profiled):

    regex time-to-match time-to-fail total-attempts total-time
    '!=|==|<<|>>|[-~+/*%=<>&^|.]' 0.00948691368103 0.400342702866 30961 0.409829616547
    "\\\\\\\\\\\\\\\\|\\\\\\\\'|\\\\\\\\\\\\n" 1.09672546387e-05 0.185655593872 2272 0.185666561127
    '#.*$' 0.00154423713684 0.162843942642 48962 0.164388179779
    '(def)((?:\\\\s|\\\\\\\\\\\\s)+)' 0.00110173225403 0.157556533813 20968 0.158658266068
    '\\\\\\\\' 0 0.120788097382 31573 0.120788097382
    '\\\\n' 0.0163886547089 0.100791454315 78414 0.117180109024
    "(?:[rR]|[uU][rR]|[rR][uU])'''" 0 0.0863089561462 17168 0.0863089561462
    '^(\\\\s*)("""(?:.|\\\\n)*?""")' 0.00609183311462 0.0763375759125 68061 0.0824294090271
    

    I did some quick tests with other highlighters, and Pygments is actually middle of the pack.

    highlightertime
    enscript0.203s
    pygments2.050s
    silvercity2.235s
    vim21s

    Can you clarify on what you'd like to see done differently?

  2. thatch

    As a followup, most systems I can think of have a hard limit on what they'll try to render as pretty code (Trac and Github, for example), and you could add caching of the rendered content if you're getting repetitive requests for the same file.

  3. gbrandl

    I'm afraid nothing can be done here in absence of specific suggestions where to increase performance. (For example, the patch in #523 has recently been applied that improves HTML output by about 10%, in the areas Tim recognized.)

  4. Log in to comment