Is there a low-baggage html option?

Create issue
Issue #1514 resolved
Gene Callahan created an issue

The package seems oriented towards processing files or at least large chunks of code. But I want to feed in perhaps as little as a single line, and get back the line with minimal html inserted: really just the span tags, no divs or anything like that. Is this currently possible? I couldn’t see it in the docs.

Comments (25)

  1. Anteru

    No, not that I’m aware of. What exactly would you like to see? Would you be interested in preparing a PR to implement the functionality you need? (I’m happy to help with it – but I’m bandwidth-constrained to implement this myself)

  2. Gene Callahan reporter

    "What exactly would you like to see?"
    Well, I could feed in cout << "The counter is at " << i << endl; and that would get hilighted properly but just as a single line to insert elsewhere. No <div> etc. So maybe:
    hilighted_line = hilight_line(orig_line)
    (Or something like that, but more in keeping with your naming conventions etc.)
    ”Would you be interested in preparing a PR to implement the functionality you need?”
    For sure! We’re preparing a course web site and are extracting C++ code from source files to generate web pages. So I’ve got teaching assistants who can chip in on this as well.
    ”I’m happy to help with it – but I’m bandwidth-constrained to implement this myself”
    Me too… but TAs!

  3. Clément Pit-Claudel

    You can already do that, actually. You need to subclass the HtmlFormatter class:

    class InlineHtmlFormatter(pygments.formatters.HtmlFormatter):  # pylint: disable=no-member
        def wrap(self, source, _outfile):
            return self._wrap_code(source)
    
        @staticmethod
        def _wrap_code(source):
            yield from source
    

    Then use pygments.highlight(code, LEXER, InlineHtmlFormatter())

  4. Clément Pit-Claudel

    I think this issue could still use a bit of work. In particular, it has two problems: 1, it will add an extraneous newline to the end of the input. 2, it would be good to have this built-in (with problem 1 fixed). In my code I actually use this instead, to work around problem 1, but a fix in Pygments would likely be much more efficient:

    WHITESPACE_RE = re.compile(r"^(\s*)((?:.*\S)?)(\s*)$", re.DOTALL)
    
    def highlight(s):
        # Pygments HTML formatter adds an unconditional newline, so we pass it only
        # the code, and we restore the spaces after highlighting.
        before, code, after = WHITESPACE_RE.match(s).groups()
        highlighted = pygments.highlight(code, LEXER, FORMATTER).strip()
        return before + highlighted + after
    
  5. Anteru

    That newline is appended to HTML code? That doesn’t seem like it is necessary, is it just for “prettier” formatting?

  6. Gene Callahan reporter

    Folks, since I have you “here” may I ask if you know of a package like pygment for markdown? I need to feed in individual markdown strings, not whole docs, and get back HTML, just like my question above. I find a number of packages that will do markdown->HTML on whole docs, but I have found nothing that does this on just snippets of a doc.

  7. Gene Callahan reporter

    Yes, that’s fine. So if you know of anything, could you email me and we could discuss it by email? ejc369@nyu.edu

  8. Anteru

    Wherever, just not here, please 🙂

    Back on topic: Clément, if there was an option to remove any extraneous whitespace added between HTML tags, would that solve the issue for you? I think adding such an option (“compact” or so) wouldn’t be a big deal, and it would be off by default to not break existing applications which may rely on it.

  9. Clément Pit-Claudel

    I think it’s mostly already the case that extraneous whitespace is avoided, except at the very end of the string. Here’s a concrete example:

    import pygments
    from pygments.lexers import CoqLexer
    from pygments.formatters.html import HtmlFormatter
    
    if __name__ == '__main__':
        lexer = CoqLexer(ensurenl=False)
        formatter = HtmlFormatter(nowrap=True)
        print(repr(pygments.highlight("a\nb", lexer, formatter)))
        print(repr(pygments.highlight("a\nb\n", lexer, formatter)))
    

    And the output is this:

    '<span class="n">a</span>\n<span class="n">b</span>\n'
    '<span class="n">a</span>\n<span class="n">b</span>\n'
    

    The problem here is that despite both nowrap and ensurenl = False, the _format_lines function in html.py adds a newline at the end of its output, which means that a\nb and a\nb\n get highlighted in the exact same way.

    Yes, an option to avoid this would be great :)

  10. Clément Pit-Claudel

    Btw, after looking further into this, you don’t need that InlineHtmlFormatter subclass (just passing nowrap=True to the default HtmlFormatter is enough). But the newlines issue remains :)

  11. Anteru

    Given it's HTML to start with, I don't see why the \n is needed at all. Certainly not the last one, the ones in-between may be acceptable to make it easier to read.

  12. Clément Pit-Claudel

    I think the newlines are needed because the pygments <pre> block is typically rendered with white-space: pre-wrap, so they are preserved (otherwise all the code would appear on one giant line).

  13. Anteru

    Right, it’s one giant <pre> block. But the last one doesn’t matter … I wonder if there can be anyone relying on it being present.

  14. Clément Pit-Claudel

    Thanks! I think this change might be a bit too aggressive. Instead of this:

    >>> print(repr(pygments.highlight("a\nb", lexer, formatter)))
    '<span class="n">a</span>\n<span class="n">b</span>'
    >>> print(repr(pygments.highlight("a\nb\n", lexer, formatter)))
    '<span class="n">a</span>\n<span class="n">b</span>'
    

    I think the correct behavior would be this:

    >>> print(repr(pygments.highlight("a\nb", lexer, formatter)))
    '<span class="n">a</span>\n<span class="n">b</span>'
    >>> print(repr(pygments.highlight("a\nb\n", lexer, formatter)))
    '<span class="n">a</span>\n<span class="n">b</span>\n'
    

    I think I was mistaken about the last \n not being significant. Consider this:

    <!DOCTYPE html>
    <html lang="en">
      <head>
        <meta charset="utf-8">
        <title>Test</title>
      </head>
      <body>
    <div class="highlight"><pre><span class="n">a</span>
    <span class="n">b</span></pre></div>
    
    <div class="highlight"><pre><span class="n">a</span>
    <span class="n">b</span>
    
    
    
    </pre></div>
      </body>
    </html>
    

    The second pre will be taller than the first one. Hence I think the best solution would be to preserve spaces exactly as in the input.

  15. Anteru

    The question is if trailing newlines are really desirable. It sounds like this wants an option like “trim trailing whitespace” or something, but that’s too much for a 2.4.1 release IMHO. If you can work around this, then I’ll revert it, but I still don’t know how to solve this bug nicely 🙂

  16. Anteru

    Note that I made it omit the trailing newline it for nowrap only, so the example you wrote won’t happen by default. If the parser sends down newlines to the formatter, you should get the intermediate ones, just not the last one.

  17. Log in to comment