UTF-8 Support

Issue #1 resolved
Sean Kauffman repo owner created an issue

nfer should support specifications and events with non-ASCII characters.

This may be fairly simple. I believe that adding support to the specifications should be as simple as being more flexible in what is accepted during parsing. Tests certainly need to be added.

Comments (3)

  1. Sean Kauffman reporter

    This post has a good suggestion for building Flex support for UTF-8 identifiers. https://stackoverflow.com/questions/9611682/flexlexer-support-for-unicode

    The gist of it is that you need to add character classes like the following:

    ASC     [\x00-\x7f]
    ASCN    [\x00-\t\v-\x7f]
    U       [\x80-\xbf]
    U1      [\x80-\x8f]
    U2      [\xc2-\xdf]
    U3      [\xe0-\xef]
    U4      [\xf0-\xf7]
    
    UANY    {ASC}|{U2}{U}|{U3}{U}{U}|{U4}{U1}{U}{U}
    UANYN   {ASCN}|{U2}{U}|{U3}{U}{U}|{U4}{U1}{U}{U} 
    UONLY   {U2}{U}|{U3}{U}{U}|{U4}{U1}{U}{U}
    

    Where UANY matches any ASCII or UTF-8 char, UANYN omits newlines, and UONLY omits ASCII.

  2. Sean Kauffman reporter
    • changed status to open

    I have spent some time looking at this, and suggested at least one change. As a result, I am changing its status to open.

  3. Sean Kauffman reporter

    This has been implemented (I'm confused why the related commit message did not resolve the issue or appear here). A functional test was added with Chinese characters for event names to demonstrate that it works.

  4. Log in to comment