JavaScanner Exception for Unicode Escape Codes
I am using java8/extendj.jar
built from d4d25af7
(giving ExtendJ 8.1.2-15-gd4d25af Java SE 8
).
When tokenizing code that contains Unicode escapes using org.extendj.scanner.JavaScanner
the nextToken()
method throws an exception with the message illegal escape sequence "\u"
.
I had a brief look at the grammar definition and I don't see Unicode escapes in there.
Find attached an example project containing a test class that triggers the exception. Just execute ./gradlew test
.
As in the passing test cases I would expect JavaScanner to produce a token whose value is a String containing the Unicode escape sequence.
Comments (7)
-
-
reporter Ah I see, my bad for not spotting that class. I suppose you should always wrap the Reader given to JavaScanner in that then?
Unfortunate that this does not fix the issue. IOException is a lot worse actually, currently we work around the problem by catching the JFlex exception but an IOException is not something we want to catch really. That should indicate some unfixable problem (e.g. disk access failed).
Do you think this is an issue you can resolve?
-
Yes, I will change the
Unicode
class so that it blocks onread()
instead of returning zero!If you need a quick workaround the JFlex FAQ mentions a workaround using another reader class to wrap a base reader and block on
read
: https://jflex.de/faq.html -
Using
ZeroReader
to wrapUnicode
, the following works without exception:@Test public void escapedEscapedUnicodeChar() throws Exception { JavaScanner scanner = new JavaScanner(new ZeroReader(new Unicode(new StringReader("\"\\uD83D\\uDE2F\"")))); tokenize(scanner); }
See https://github.com/jflex-de/jflex/tree/master/jflex/examples/zero-reader
-
reporter Alright, nice :) Thank you for looking into this so quickly!
-
I have updated the
Unicode
class in ExtendJ and renamed it toUnicodeEscapeReader
. I deprecated the oldUnicode
class, but it will just delegate to the newUnicodeEscapeReader
for the time being.I also added tests that mirror your test to ensure that we don't get the above mentioned JFlex scanner errors due to
read()
reading zero-length character sequences. -
- changed status to resolved
Improve Unicode escape handling
Renamed class Unicode in package org.extendj.scanner to UnicodeEscapeReader, to better describe what the class does. There is now a deprecated class in place of the old Unicode class which just delegates to UnicodeEscapeReader.
UnicodeEscapeReader improves upon the old Unicode class in a few ways:
- read(char[], int, int) will read zero characters in fewer cases, working better with our JFlex scanners (which throw an error if zero characters are read).
- Simplified the implementation to only do Unicode escape translation in one place.
fixes
#310→ <<cset cff8b8990e80>>
- Log in to comment
There is a separate class
org.extendj.scanner.Unicode
which takes care of filtering unicode escapes. It inherits fromFilteredReader
, so you can use it like this:However, it does not seem to work perfectly with your example code because apparently JFlex (
JavaScanner
) expects allread()
calls to read non-zero numbers of characters butUnicode
can read zero characters in some cases: