new lexer request: dot (graphviz)

Issue #1024 new
Peter Suter
created an issue

The DOT language is used by the open source graph visualization software Graphviz to represent structural information as diagrams of abstract graphs and networks.

The language grammar is described at http://www.graphviz.org/doc/info/lang.html

An example:

    digraph G {
       Hello->Pygments
    }

Would be nice to have a lexer for this in Pygments.

Comments (10)

  1. Peter Suter reporter

    A basic attempt:

    from pygments.lexer import RegexLexer, bygroups
    from pygments.token import (Comment, Keyword, Operator, Name, String,
        Number, Punctuation, Whitespace)
    
    __all__ = ['GraphvizLexer']
    
    
    class GraphvizLexer(RegexLexer):
        """
        For graphviz DOT graph description language.
    
        .. versionadded:: 2.3.0
        """
        name = 'Graphviz'
        aliases = ['graphviz']
        filenames = ['*.gv', '*.dot']
        mimetypes = ['text/x-graphviz']
        tokens = {
            'root': [
                (r'\s+', Whitespace),
                (r'(#|//).*?$', Comment.Single),
                (r'/(\\\n)?[*](.|\n)*?[*](\\\n)?/', Comment.Multiline),
                (r'(?i)(node|edge|graph|digraph|subgraph|strict)\b', Keyword),
                (r'--|->', Operator),
                (r'[{}[\]:;,]', Punctuation),
                (r'(\b\D\w*)(\s*)(=)(\s*)', bygroups(Name.Attribute, Whitespace, Punctuation, Whitespace), 'attr_id'),
                (r'\b(n|ne|e|se|s|sw|w|nw|c|_)\b', Name.Builtin),
                (r'\b\D\w*', Name.Tag), # node
                (r'[-]?((\.[0-9]+)|([0-9]+(\.[0-9]*)?))', Number),
                (r'"(\\"|[^"])*?"', Name.Tag), # quoted node
                (r'<', Punctuation, 'xml'),
            ],
            'attr_id': [
                (r'\b\D\w*', String, '#pop'),
                (r'[-]?((\.[0-9]+)|([0-9]+(\.[0-9]*)?))', Number, '#pop'),
                (r'"(\\"|[^"])*?"', String.Double, '#pop'),
                (r'<', Punctuation, ('#pop', 'xml')),
            ],
            'xml': [
                (r'<', Punctuation, '#push'),
                (r'>', Punctuation, '#pop'),
                (r'\s+', Whitespace),
                (r'[^<>\s]', Name.Tag),
            ]
        }
    
  2. Peter Suter reporter

    Like what?

    It explicitly lists all the following things mentioned in the language grammar:

    • All six case-independent keywords (node, edge, graph, digraph, subgraph, strict).
    • All ten compass point values (n, ne, e, se, s, sw, w, nw, c, _).
    • All seven "punctuation" characters ({, }, [, ], :, ;, ,).
    • Both edge operators (--, ->).
    • All comment styles (/* */, //, #).
    • Whitespace.
    • All four ID identifiers:

      • Strings. (But not using the exact character ranges.)
      • Numerals.
      • Double quoted strings ("). (Missing: Multi-line escaping and + operator to concatenate strings.)
      • HTML strings (<, >). (But not using real XML parsing. No & escape sequences etc.)

    I don't see any other tokens mentioned in the grammar. The missing things (XML, multi-line escaping etc.) all seem quite exotic and unimportant to me, but feel free to add them.

    On a higher level, the strings can represent different things like attribute names, node names etc.
    The ~200 attribute names etc. are not explicitly listed.
    The node names are user defined, so can't be explicitly listed.
    But attribute and node names are distinguished implicitly by position.

    If this all works 100% correctly in all cases I don't know.

  3. Log in to comment