Square boxes around Unicode characters in Julia

Issue #1074 new
Nehal Patel
created an issue

Hi -- As demonstrated here: http://pygments.org/demo/1061745/ Pygments is putting a red box around unicode characters where rendering Julia

I'd be willing to help, and I had a look at the lexer for Julia, but I've not much experience with unicode, python, (or pygments)

Would someone be either willing to give a few pointers or help out?

thanks!

Comments (14)

  1. David Corbett

    The rules for Julia variables are described here. The exact sets of characters are defined in jl_id_start_char and jl_id_char. Note that fl_accum_julia_symbol disallows ! in an identifier immediately before =; e.g. x!=y is three tokens: x, !=, and y.

    Unicode whitespace is defined in is_uws. This matches what \s matches in a Python regex with the re.UNICODE flag set.

    Unicode operators are defined in julia-parser.scm; cf. operators and the lists it is built from.

    Pygments has some useful Unicode regexes in unistring.py. They are for Unicode 6.3.0; I don’t know what version Julia uses. It may be possible to simplify the JuliaLexer identifier regex to not need to use these, if it safe to overgeneralize the regex.

    The only things in JuliaLexer you need to change are the regexes for tokens that can use Unicode: whitespace, operators, and names. There are many places in the lexer which use '[a-zA-Z_]\w*' to match names; with a more complicated Unicode-aware regex, it would be better to define a variable for names to avoid bug-prone repetition; e.g. _name = r'...' and use _name within tokens.

  2. Nehal Patel

    Hi -- I've started working on this at https://bitbucket.org/lilinjn/pygments-main. (Only a few baby steps for now)

    There are enough moving parts that it might take a bit of time...

    One thing that I am don't understand in any great detail are the potential subtleties regarding Python 2 vs 3 and unicode, as well as pygments preferred approach to any issues. A brief inquiry leads me to believe that there are not real issues here but I'm not positive. For instance, if I do something like _name = u'...', will things be ok for all the platforms Pygments targets? (fyi, I'm planning on building up my regex's using the helpers from unistring.py as David suggests)

  3. Nehal Patel

    Also -- I'm sure this is a basic python question but should my reusable regex's look something like:

    _name = re.compile(ur'...', re.UNICODE)

    and then later on within tokens

    (_name, Name),

    (hmm, what am i trying to ask...? It looks like in the current Julia Lexer, regex'es are specified using uncompiled patterns via "raw strings", and presumably Pygments compiles them later on. I'm not entirely sure what happens if I pass Pygments a precompiled regex -- presumably this is fine both in terms of correctness and style? If I don't precompile the regexes, I'm not sure how to tell Pygments to set the re.UNICODE flag)

  4. Georg Brandl repo owner

    u'...' strings are fine as long as they aren't raw strings (ur'...'). The latter are not supported in Python 3.

    You should not compile any regexes yourself, Pygments only allows strings. The unicode flag can be set either for the whole lexer with the "flags" class attribute (look around the code for examples), or in the regex with (?u). However, despite its name UNICODE mode is not required to match Unicode strings, it only has quite narrow functionality that's not needed for many regexes (see https://docs.python.org/2/library/re.html#re.U).

  5. Nehal Patel

    I've managed to use julia's parser to create a python array of valid operator strings:

    _operators = [u'≻', u'≪', u'⪐', u'⩶', u'⫗', u'⊚', u'...', u'∤', u'⨳', u'⥣', u'⨨', u'.==', u'√', u'⪁', u'⬺', u'×',
                      u'⤃', u'⋥', u'⥭', u'⥑', u'⨹', u'≱', u'⋶', u'⪀', u'∔', u'⇏', u'⫋', u'⪸', u'↓', u'⤈', u'::', u'≚', u'⊓',
                      u'⧺', u'⊉', u'=', u'<|', u'⇵', u'⊶', u'⪆', u'⊬', u'≢', u'⫍', u'⇎', u'⪖', u'!=', u'⥗', u'⊲', u'⤖',
                      u'⪺', u'⤁', u'⧀', u'-->', u'⊎', u'⩳', u'⨭', u'⥧', u'⧴', u'.-=', u'⥢', u'⩠', u'≎', u'⤌', u'≜', u'⪝',
                      u'⊏', u'⋦', u'⭋', u'<:', u'⩦', u'⊁', u'--', u'∩', u'⬱', u'⥡', u'⩄', u'⋧', u'.<', u'≽', u'⪬', u'⥍',
                      u'⪎', u'⋢', u'?', u'⋷', u'↔', u'⥌', u'≰', u'⟶', u'⪻', u'⋝', u'⋾', u'⪭', u'⤘', u'←', u'∙', u'⩊', u'∧',
                      u'⥪', u'⪠', u'⫖', u'⭇', u'⤆', u'⋕', u'⪮', u'∌', u'.≥', u'∨', u'⇺', u'≺', u'∸', u'⨪', u'⪒', u'⤋', u'⤏',
                      u'⩒', u'⦼', u'⬵', u'≂', u'.%', u'⋡', u'⪣', u'⫐', u'±', u'⤍', u'⨣', u'⥉', u'⬻', u".'", u'⧡', u'≶',
                      u'⋭', u'⧷', u'⭄', u'≍', u':=', u'⟿', u'⪊', u'⫛', u'∓', u'⥘', u'=>', u'⩛', u'≹', u'⩰', u'&&', u'⨩',
                      u'>>>', u'⫃', u'⨺', u'∦', u'⊙', u'≬', u'⤉', u'≋', u'⬿', u'∝', u'⊃', u'⩵', u'.-', u"'", u'≦', u'⥝',
                      u'⪛', u'≗', u'⋇', u'⥤', u'⟻', u'∜', u'÷', u'↦', u'⥥', u'⭈', u'===', u'⊕', u'⨫', u'⇿', u'⤄', u'⨵',
                      u'⪙', u'⧁', u'⊗', u'⟒', u'⤝', u'⥅', u'⪓', u'→', u'⪧', u'≑', u'↚', u'⩼', u'⨇', u'⪯', u'⊈', u'⤒', u'≞',
                      u'.+=', u'⦾', u'⋄', u'⤅', u'⩔', u'-=', u'↮', u'⨲', u'.^', u'⥄', u'⊜', u'⥨', u'⩱', u'⋚', u'⊞', u'⧥',
                      u'⫔', u'$=', u'⪗', u'⩑', u'⪶', u'≝', u'⪕', u'⭁', u'⋼', u'⩘', u'⤞', u'⋊', u'∋', u'⋽', u'⊊', u'∈', u'⭊',
                      u'⦸', u'≨', u'⋴', u'>>>=', u'⪌', u'≆', u'⋑', u'⋍', u'⋩', u'∺', u'⋎', u'⪰', u'⪉', u'⪢', u'⋺', u'⇻',
                      u'.=', u'⊐', u'+', u'⊱', u'⩗', u'>=', u'⫈', u'≀', u'⋗', u'⥎', u'≌', u'∍', u'≩', u'⊣', u'⇹', u'⩭',
                      u'⤇', u'⅋', u'⪔', u'⋛', u'⦷', u'>:', u'+=', u'≾', u'⋓', u'⥕', u'⥜', u'>', u'∪', u'⫏', u'^=', u'⫑',
                      u'⤟', u'!', u'⫘', u'⋐', u'.>>', u'==', u'!==', u'⊼', u'⪥', u'⤕', u'\\', u'⫌', u'>>=', u'⩮', u'⩐',
                      u'≤', u'⩲', u'⫅', u'⟽', u'⤂', u'⋉', u'≘', u'⩯', u'⨈', u'⥈', u'⨽', u'⨧', u'⪲', u'⫹', u'.≤', u'⪇', u'⬸',
                      u'⧣', u'^', u'⊻', u'↠', u'⇴', u'⥒', u'⧶', u'⪋', u'⊷', u'%', u'⋵', u'⥙', u'≳', u'⇶', u'⬼', u'⊠', u'⇔',
                      u'⊇', u'<<', u'⪪', u'⨤', u'⪨', u'//=', u'⋒', u'≥', u'⪾', u'⊖', u'⥮', u'⫉', u'∥', u'⟺', u'⤎', u'⤓',
                      u'⟾', u'⫆', u'⟑', u'⤗', u'⫸', u'¬', u'≊', u'⨷', u'<<=', u'≉', u'≕', u'⋏', u'⊅', u'.', u'⪅', u'≅',
                      u'⊡', u'∛', u'⥐', u'⩚', u'⋜', u'||', u':', u'⥋', u'⊀', u'⩋', u'⥆', u'⋞', u'≏', u'⥏', u'⬷', u'⭉', u'~',
                      u'⟱', u'⟵', u'⪷', u'|>', u'⥟', u'⥖', u'⩾', u'⫄', u'-', u'$', u'⟼', u'⤔', u'⪹', u'≧', u'%=', u'.>',
                      u'⪟', u'⬰', u'.>=', u'⩌', u'≔', u'.\\', u'.//=', u'⊒', u'./', u'⊵', u'⩫', u'⊂', u'⭃', u'⪞', u'.!',
                      u'⥇', u'≃', u'≖', u'⩺', u'.^=', u'≯', u'⪘', u'⋪', u'⋋', u'<', u'⥛', u'⊩', u'⋿', u'⪽', u'*=', u'⩎',
                      u'*', u'≈', u'⫺', u'⋬', u'⬲', u'≷', u'⊛', u'<=', u'⪤', u'⟈', u'⥓', u'⪫', u'⫊', u'⩍', u'.*=', u'⪂',
                      u'⋠', u'⪦', u'≟', u'≠', u'≿', u'≁', u'⪳', u'&=', u'⟉', u'⨻', u'≣', u'>>', u'⬽', u'⇸', u'⫒', u'↑',
                      u'↣', u'∻', u'⩴', u'//', u'⋻', u'⋲', u'⇽', u'⊋', u'⫕', u'∘', u'∽', u'/', u'⋤', u'⇾', u'⪑', u'⨸', u'⪡',
                      u'≼', u'⭂', u'↑', u'⇒', u'↓', u'⋙', u'⨰', u'⫙', u'.//', u'->', u'⫁', u'≡', u'≓', u'⋆', u'⩂', u'⊮',
                      u'⨥', u'⋅', u'⪚', u'⋣', u'≮', u'⫎', u'⥦', u'⪿', u'⬳', u'⨼', u'⩅', u'⨴', u'⊔', u'⊰', u'⩧', u'⋖', u'⩏',
                      u'|=', u'⨦', u'≇', u'⨱', u'≲', u'⋟', u'⩬', u'⥚', u'⥰', u'⫷', u'≐', u'⪜', u'⧤', u'⩞', u'⦿', u'.+',
                      u'⫇', u'⨢', u'→', u'⧻', u'..', u'⩹', u'⥫', u'⥔', u'⩸', u'⊑', u'⬹', u'⨬', u'⩜', u'⭀', u'⥬', u'.%=',
                      u'⋹', u'⇷', u'⪼', u'./=', u'⪃', u'⤐', u'⊽', u'≛', u'⬶', u'⤑', u'⨮', u'⩣', u'⩓', u'⋘', u'⟰', u'∉',
                      u'⩟', u'⊟', u'|', u'⩢', u'⊍', u'←', u'⩃', u'⤊', u'⥞', u'≭', u'.>>>', u'⟷', u'.≠', u'⥯', u'⪱', u'⪏',
                      u'⪵', u'⪈', u'⩷', u'⩝', u'⪩', u'.<<', u'⋫', u'≸', u'⩖', u'⬴', u'⪴', u'⫂', u'⩀', u'\\=', u'⤠', u'∾',
                      u'∷', u'⪍', u'≒', u'∊', u'⟹', u'.*', u'∗', u'⊆', u'⊢', u'⊴', u'≙', u'⋳', u'⩡', u'⋌', u'⩪', u'≵', u'↛',
                      u'⫀', u'≴', u'⩁', u'.!=', u'⩿', u'⥩', u'⩻', u'⋸', u'⤀', u'⊄', u'⪄', u'⥊', u'⊘', u'⭌', u'≄', u'⬾',
                      u'⨶', u'⊳', u'.\\=', u'.<=', u'⩽', u'⫓', u'⇼', u'&', u'⩕', u'⥠', u'/=', u'≫', u'⋨']
    

    I'm not entirely sure what's a good way to use this information to create a performant regex for pygments from this. Does everything in Pygments have to be a regex -- it looks like builtins are specified as an array, which would be convenient for the operators in Julia as well -- otherwise, is there a way to take an array of python strings and create a regex for the union that properly escapes everything -- there are 630 ish operators (some single char, some multi-char, etc) -- what are the performance implications of creating a humongous unioned regex?

  6. Nehal Patel

    @George -- thanks for the unicode clarification -- I think my only use case for re.UNICODE is to make sure \s matches per the julia spec.

    I've started actually reading the doc's at http://pygments.org/docs/lexerdevelopment/ (which are very nice) and I think I can answer my previous question from before based on matching large keywords lists.

    The documentation mentions that re flags can be added to a RegexLexer (i.e. to specify re.UNICODE), but I'm not sure how to do this (I'm sure I'll figure it out sooner or later, but python is a bit new to me)

    thanks for the clarification

  7. Anonymous

    Hi there, any progress on this issue? I've been trying (unsuccessfully) to figure out how to hide those red boxes...

    N.b. it's not just operators but also greek (and other) letters...

    Does anyone have a temporary workaround?

  8. Nehal Patel

    Hi Zac -- the temporary work around is to evaluate the following in a cell in your notebook:

    display("text/html",
          """
          <script type="text/javascript">
    
          \$( document ).ready(function() {
          console.log( "Fixing red boxes" );
          \$( ".err" ).css( "border", "0px solid red" );
          \$( ".err" ).css( "border-style", "none" );
          });
    
          </script>""")
    

    scroll to the bottom of http://nbviewer.ipython.org/github/lilinjn/lilKanren/blob/master/FirstSteps.ipynb to see an example

    (Unfortunately, once I figured out the workaround, I stopped working on the correct fix... I would say that the new commits in https://bitbucket.org/lilinjn/pygments-main represent about 33% of the total work needed to fix this properly...)

  9. Anonymous

    Hey Nehal thanks. Unfortunately I'm using pygments for syntax highlighting in a latex document, so your fix won't apply to me (as far as I can tell). I think I'm just going to edit the style and colour the boxes white.

  10. Log in to comment