Matiec fails to parse russian strings

Create issue
Issue #60 new
Павел Бельтюков created an issue

This code

CONCAT8_OUT := CONCAT('Привет', INT_TO_STRING9_OUT);

gives an error:

Parsing failed because of too many consecutive syntax errors. Bailing out!

If the code is changed to

CONCAT8_OUT := CONCAT('Hello', INT_TO_STRING9_OUT);

matiec works fine...

So there are some unicode related issues...

UPDATE (Feb 18th 2017):

OK, I've rebuild matiec with YYDEBUG, added empty ieclib.txt to temp dir, and added this plc.st:

CONFIGURATION test
  VAR_GLOBAL
    tst_string : STRING := 'Привет';
  END_VAR
END_CONFIGURATION

In log file there is a fragment:

Next token is token ASSIGN (: )
Shifting token ASSIGN (: )
Entering state 612
Reading a token: Next token is token $undefined (: )
Shifting token error (: )
Entering state 1117
Reducing stack by rule 286 (line 2685):
   $1 = nterm elementary_type_name (: )
   $2 = token ASSIGN (: )
   $3 = token error (: )
/home/anon/YAPLC/plc.st:3-28..3-28: error: invalid initial value in specification with initialization.

In iec_flex.II I see:

common_character_representation     [\x20\x21\x23\x25\x26\x28-\x7E]|{esc_char}
double_byte_character_representation    $\"|'|{double_byte_char}|{common_character_representation}
single_byte_character_representation    $'|\"|{single_byte_char}|{common_character_representation}

I think that's the reason why unicode strings can't be parsed by matiec. Symbols \x80-\xff are not recognized by tokenizer.

How about adding UTF-8 string support to matiec?

What will be the impact on matiec if one Simply replaces

common_character_representation     [\x20\x21\x23\x25\x26\x28-\x7E]|{esc_char}

with

utf_8_start_char    [\xC0-\xDF]|[\xE0-\xEF]|[\xF0-\xF7]
utf_8_end_char     [\x80-\xBF]
utf_8_char             {utf_8_start_char}|{utf_8_end_char}
common_character_representation     [\x20\x21\x23\x25\x26\x28-\x7E]|{utf_8_char}|{esc_char}

?

If it's possible to make matiec accept unicode strings in such fassion, then we can add some hook on stage 3 to check unicode strings, as utf-8 has self synchronization we can even drop incorrect values and give a warning to user...

Actuallly we don't need to check string correctness, as C-compiler or even PLC runtime can be responsible for this.

Comments (1)

  1. Log in to comment