Improved heuristic for guessing S and R

Thomas Aglassinger avatarThomas Aglassinger created an issue

The current math.SLexer seems a bit overly enthusiastic when guessing whether a source code is S (or R):

    def analyse_text(text):
        return '<-' in text

Consequently as soon the text contains the character sequence <- the lexer is 100% sure that it encountered an S or R source code.

In practice I found that <- is used inside comments to "point" to a certain code line, ASCII diagrams and in "alt" descriptions for XML/HTML. For example:

>>> xml_code = """
... <?xml version="1.0" encoding="UTF-8" ?>
... <some><img src="arrow_left.png" alt="<-- check this out" /></some>
... """
>>> from pygments.lexers import guess_lexer
>>> guess_lexer(xml_code)
<pygments.lexers.SLexer>

While I'm not really familiar with S and R, the following changes should be an improvement:

  1. Due the simplicity of the current heuristic, SLexer.analyse_text() could use a less bold return value than 1.0 (actually True, which is converted to 1.0) such as 0.11.
  2. From my limited understanding of R, the <- operator can only be preceeded by white space, a symbol name, or a vector index (ending in ) or ]).

To outline 2., here are a few examples from "An introduction to R" http://cran.r-project.org/doc/manuals/r-release/R-intro.html:

fruit <- c(5, 10, 1, 20)
names(fruit) <- c("orange", "banana", "apple", "peach")
y <- x[-(1:5)]
y[y < 0] <- -y[y < 0]

Here is an analyse_text() that would take all this into account:

    def analyse_text(text):
        result = 0.0
        if re.search(r'[a-z0-9_\])\s]<-', text) is not None:
            result = 0.11
        return result

With this modification, the XML example code from above is properly guessed:

>>> xml_code = """
... <?xml version="1.0" encoding="UTF-8" ?>
... <some><img src="arrow_left.png" alt="<-- check this out" /></some>
... """
>>> from pygments.lexers import guess_lexer
>>> guess_lexer(xml_code)
<pygments.lexers.LassoXmlLexer>

Comments (4)

  1. Log in to comment
Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.