Linter to check for single-character output in examplefiles

Issue #1164 new
Tim Hatch
created an issue

We've had performance issues in the past of the form

'string': [
    ('"', String, '#pop'),
    ('\\n', String),
    ('.', String),
],

where the single . not being [^"\\]+ causes many more tokens to be output. This should be fairly easy to check with the existing example files (at least ~4 consecutive single-character tokens of the same type), assuming adequate coverage.

Comments (4)

  1. Hiroaki Itoh

    Regarding to perfoemance issues, at least, I think you (we?) should rewrite example written in the document, also.

    I tried to modify my PR#511 as:

            'comment': [
                (r'[^*/]+', Comment.Multiline),  # add +
                (r'\*/', Comment.Multiline, '#pop'),
                (r'[*/]+', Comment.Multiline)  # add +
            ],
    

    then, I could improve the performance a little.

    I tested with below code:

    # -*- coding: utf-8 -*-
    """
    This is not a part of nose test.
    """
    
    from __future__ import print_function
    
    import os
    from fnmatch import fnmatch
    import timeit
    
    import pygments
    from pygments.lexers import get_lexer_for_filename, get_lexer_by_name
    from pygments.util import ClassNotFound
    from pygments.formatters.terminal import TerminalFormatter
    from pygments.formatters.html import HtmlFormatter
    from test_examplefiles import check_lexer
    
    try:
        import StringIO
        StringIO = StringIO.StringIO
    except ImportError:
        # python 3
        import io
        from io import StringIO
    
    TESTDIR = os.path.dirname(__file__)
    
    def perftest_target_example_file(fnpattern):
        for fn in sorted(os.listdir(os.path.join(TESTDIR, 'examplefiles'))):
            if fn.startswith('.') or fn.endswith('#'):
                continue
            if not fnmatch(fn, fnpattern):
                continue
    
            absfn = os.path.join(TESTDIR, 'examplefiles', fn)
            if not os.path.isfile(absfn):
                continue
    
            print(absfn)
            with open(absfn, 'rb') as f:
                code = f.read()
            try:
                code = code.decode('utf-8')
            except UnicodeError:
                code = code.decode('latin1')
    
            lx = None
            if '_' in fn:
                try:
                    lx = get_lexer_by_name(fn.split('_')[0])
                except ClassNotFound:
                    pass
            if lx is None:
                try:
                    lx = get_lexer_for_filename(absfn, code=code)
                except ClassNotFound:
                    raise AssertionError('file %r has no registered extension, '
                                         'nor is of the form <lexer>_filename '
                                         'for overriding, thus no lexer found.'
                                         % fn)
            import time
            res = 0.
            N = 3000
            for i in range(N):
                #formatter = TerminalFormatter()
                formatter = HtmlFormatter()
                render_result_stream = StringIO()
                t1 = time.time()
                pygments.highlight(code, lx, formatter, render_result_stream)
                t2 = time.time()
                terminal = render_result_stream.getvalue()
                res += (t2 - t1)
            print(terminal)
            print("%d bytes, %.4f [ms] / %.6f [ms/byte]" % (
                    len(code),
                    1000 * res / N,
                    1000 * res / N / len(code)))
    
    if __name__ == '__main__':
        import sys
        perftest_target_example_file(sys.argv[1])
    

    The results are:

    before

    475 bytes, 2.4387 [ms] / 0.005134 [ms/byte]
    

    after

    475 bytes, 1.9927 [ms] / 0.004195 [ms/byte]
    

    C-style comment occurs too many, so we can improve lexers overall.

  2. Tim Hatch reporter

    Yes (although your r'[*/]+' should be r'[*/]' -- this one needs to keep single characters to work correctly e.g. /* /*/). There are lots of cases where outputting several characters is worthwhile (another is simply \s vs \s+). Thanks for taking an interest.

  3. Hiroaki Itoh

    although your r'[*/]+' should be r'[*/]' -- this one needs to keep single characters to work correctly e.g. /* /*/

    Oh, indeed. Thanks.

    another is simply \s vs \s+

    My another PR had this case.

    Thanks for taking an interest.

    The performance of lexers directly affect Sphinx, etc. Documents written by Sphinx sometimes have many code snippets, so lexer performance is not be underestimated, I think.

  4. Log in to comment