Ezhil Language Lexer for Pygments

#443 Merged at 5e4778f
  1. Write Code in Tamil!

This pull request adds Pygments support for Ezhil. Ezhil is a programming language based on Tamil script (Indian language) developed at http://ezhillang.org.

Code changes to Pygments 1. Ezhil Language Lexer for Pygments 2. Unittests for 1 3. Example Ezhil file

  • Issues #1: New squid.conf lexer resolved

Comments (11)

  1. Write Code in Tamil! author

    Thanks for your detailed review +David Corbett. I will update the pull request and let you merge it. It may take a few days. Thanks for your patience. -Muthu

  2. Write Code in Tamil! author

    I made most of the updates you have mentioned, except moving the keywords. I'm not sure what is the issue here. Also we don't use Tamil-style numbers in Ezhil, ony Indian/Arabic numbers like everyone else.

    Thanks for detailed comments, and research on your part, David, I have a trimmed down limited in size, Pygments lexical analyzer for Ezhil.

  3. David Corbett

    Apparently puḷḷis and vowel signs don’t count as alphabetic, which is why r'\b' doesn’t work after Tamil words. Instead, you should use a negative lookahead suffix, i.e. words((u'பதிப்பி', u'தேர்ந்தெடு', ...), suffix=r'(?![^][ ,\t\r\n/\-+^=*)(><&|!%{};\'"$@#])'). This applies to both keywords and operators: everything that is written in Tamil.

    The following script works with the latest version of ez:

    é.€ = 123
    பதிப்பி é.€

    It prints 123, as expected. However, the lexer does not recognize the identifier (é.€). This identifier is valid because any isalpha character is a legal starting character, which is not restricted to ASCII.

    _taletters should be defined inside the lexer. Also, you use it as a character set, but you define it is a mishmash of a character set and a list of disjunctions. A character set should only include single characters, not graphemes like ஹௌ. If you want a character class which matches all Tamil characters, ASCII letters, and underscore, you could use u'A-Z_a-zஃஅ-ஊஎ-ஐஒ-கஙசஜஞடணதநனபம-ஹா-ூெ-ைொ-்', but as I’ve mentioned above, that isn’t actually what Ezhil currently recognizes.

    The following script works:

    x= "Hello"
    பதிப்பி x

    but the lexer does not recognize the space character " ". (Any isspace character is treated as whitespace.) The whitespace regex should be r'(?u)\s+'.

    The string regex should be r'".*?"' in case there are ever two strings on the same line.

  4. Write Code in Tamil! author

    A few points to address your comments:

    1. We can think of current Ezhil lexer implementation as concrete expression of the canonical language, and this is somewhat buggy. I will take down your cases as bugs to Ezhil.

    2. Because of separation of 1, we can write a Pygments Lexer for Ezhil to target the canonical Ezhil code.

    3. While the regex is something I haven't seen/used before, I could try it; but my preference was to use _taletters as disjunction OR-ed list of options, since Tamil letters are grouped correctly; the unicode ordering for Tamil is somewhat made complex for the codepoint -> name mapping. Unicode Tamil FAQ

    4. I can make other Ezhil Pygments lexer updates and we can review it again.

    Thanks again for your detailed comments, David.


  5. Write Code in Tamil! author

    Hello David,

    I have updated the items like you have recommended. I have a preference for disjunction as explained to you in previous comment, and I feel the nuances of Tamil language unicode spec require it. I hope you may find this sufficient to merge into Pygments.



    P.S. commit msg reproduced here.

    1. set flags to re.MULTILINE | re.UNICODE
    2. Move TALETTERS as class constant. Due to 'grapheme' mapping to Tamil letter nature, I prefer to have the disjunction regexp
    3. Ezhil lexical issues addressed within Ezhil implementation; done separately.
    4. This pygments lexer is targeted toward canonical Ezhil language.
    5. Update string regexp as reviewed r'".*?"'
  6. David Corbett

    The lexer’s docstring should follow the same format as the other lexers’ docstrings; see Inform6Lexer for example.

    The (?u)s in the number regexes are redundant with re.UNICODE in the flags, so the (?u)s can be removed.

    There are many more built-in functions in ezhil.py, Interpreter.py, and EZTurtle.py.

    If the final line of a file is a comment, and there is no newline at the end of the file, and pygmentize is run with -O ensurenl=0, the comment will not be lexed right. The solution is to change the comment regex to r'#.*'.

    The following script is not lexed right:

    வரை_ = True
    இல்_ = False
    @(வரை_||இல்_) ஆனால்
    ", "\t", "t"))

    FYI, the current implementation of Ezhil allows அ், ரிிிி, and as identifiers. It also allows " " as whitespace and ١٢٣ as a number.

  7. Write Code in Tamil! author

    Fair point. I will look into this and see how we can update the regexps.

  8. Write Code in Tamil! author

    Thanks for your interest in this pull request/problem.

    I am trying to restart work on this. One other alternative I think could be installing ezhil-lexer separately at the entry-point in pygments, if I'm not able to provide a water-tight lexer here.

  9. Tim Hatch

    I'm picking this up and about to merge, but have a couple of license questions.

    1. We typically have a statement that says "Copyright the Pygments authors" and include your name in the AUTHORS file. Is this okay?
    2. If you're the author of the example file, can you include a statement below the copyright line (in English) allowing redistribution under the BSD license?
  10. Muthu A

    Hello Tim, Thanks for integrating the code within Pygments. Regarding your queries,

    1. I am happy to have my name included as part of the Authors packge.
    2. I am the author of the example file, and "I allow distribution of the Ezhil program example with rest of Pygments package, under the BSD license"