Source

kink / src / doc / language / lexical.rst

Lexical structure

A program is regarded as a sequence of tokens. Whitespace characters, comments and the linefeed character can be located at almost all places between tokens.

Tokens are analyzed by the greedy-match rule. For example, a fragment catch22 is analyzed as one token: "catch22." If you want the text to be analyzed in a different way, separate tokens by whitespace characters like catch 22. It is analyzed as two tokens: "catch" and "22."

Whitespace, comments and line feed

:dfn:`Whitespace` characters are used to make a program easy to read and to separate tokens from each other. Space (U+0020), horizontal tab (U+0009) and carriage return (U+000d) are whitespace characters.

:dfn:`Comments` are used to describe the program. A number sign # (U+0023) indicates a start of a comment. The comment continues till the end of the line, which is before the line feed character or the end of the program. Comments are treated as whitespace characters.

printline(21*2)      # => 42

printline( 21 * 2 )  # => 42

# Comment line
do_something  # trailing comment

:dfn:`Line feed` (U+000a) characters are used to make a program easy to read, just like whitespace characters.

# All expressions in one line
:Num = ARGV.first.int  :Result = Num * 3  printline(Result)

# Separate expressions by line feed characters
:Num = ARGV.first.int
:Result = Num * 3
printline(Result)

Each of following marks may have a different meaning if located after whitespace characters or line feed.

Mark Meaning changed by
( Whitespace or line feed
{ Line feed

See :ref:`the description of terminal symbols <language-syntax-terminalsymbols>` for detail.

Symbol tokens

A :dfn:`symbol token` consists of a leading ascii letter (a-zA-Z) or an underscore _ (U+005f), and a trailing sequence of zero or more ascii letters (a-zA-Z), ascii digits (0-9), underscores _ (U+005f) and question marks ? (U+0x3f). There are two types of symbols: verbs and nouns.

If the first character of a symbol is a lower case letter (a-z), it is a :dfn:`verb`. Verbs are commonly used for names of variables which contain functions.

These are exmaples of verbs.

  • any?
  • _loop
  • getClassLoader

If the first character of a symbol is an upper case letter (A-Z) or an underscore _, it is a :dfn:`noun`. Nouns are commonly used for names of variables which contain regular values.

These are examples of nouns.

  • ArrayList
  • MAX_VALUE
  • More_lines?

Integer tokens

There are three types of :dfn:`integer tokens`: base10, base16 and base2.

An integer in the :dfn:`base10` notation consists of one or more digits (0-9). An integer in the :dfn:`base16` notation consists of a prefix 0x and one or more hexadecimal digits (0-9a-f). An integer in the :dfn:`base2` notation consists of a prefix 0b and one or more binary digits (0-1).

You can place spacing underscores _ (U+005f), after a prefix 0x or 0b, between digits, and after digits. Underscores are simply ignored.

Each of these integer tokens represents 42.

  • 42
  • 42__
  • 0042
  • 0x2a
  • 0b_10_1010

Note that octal integer notation is not supported. A sequence of digits which starts with 0 is read as a base10 integer.

Decimal tokens

A :dfn:`decimal` token consists of one or more digits (0-9) which represents the integer portion, a period . (U+002e), and one or more digits (0-9) which represents the fractional portion.

On the integer portion and the fractional portion, you can place spacing underscores _ (U+005f) between digits and after digits. Underscores are simply ignored.

These are examples of decimals.

  • 0.0
  • 0.001
  • 3.141_592_653

String tokens

There are two types of :dfn:`string tokens`: simple string tokens and rich string tokens.

In a :dfn:`simple string token`, any characters between the two single quotation marks ' (U+0027) are the content of the string. If you want to include a quotation mark itself in the string, put two consecutive quotation marks.

These are examples of simple strings.

  • 'Hello world'
  • 'Let''s go!' (it represents "Let's go!")

In a :dfn:`rich string token`, characters between the two double quotation marks " (U+0022) are the content of the string. In the token, a sequence of characters prefixed by a backslash \ (U+005c) represents a special character, such as a line feed (\n) or a double quotation mark (\").

These are examples of rich string tokens.

  • "Hey! ho! let's go!"
  • "GET /index.html HTTP/1.1\r\nHost: host.example.org\r\n"

Here is a list of :dfn:`backslash notations`.

Notation Unicode Description
\0 U+0000 Null character
\a U+0007 Bell
\b U+0008 Backspace
\t U+0009 Horizontal tab
\n U+000a Line feed
\v U+000b Vertical tab
\f U+000c Form feed
\r U+000d Carriage return
\e U+001b Escape
\" U+0022 Double quotation mark "
\\ U+005c Backslash \
\uxxxx U+xxxx Character specified by Unicode. xxxx are four hexadecimal digits (0-9a-f).
\Uxxxxxx U+xxxxxx Character specified by Unicode. xxxxxx are six hexadecimal digits (0-9a-f).

Regex tokens

A :dfn:`regex token` consists of a leading percent character % (U+0025), and a following string token, simple or rich.

These are examples of regex tokens.

  • %'[A-Z][_A-Za-z0-9?]*' (pattern of noun symbols)
  • %"'(''|[^'])*'" (pattern of simple strings)

Mark tokens

Here is a list of :dfn:`mark tokens`.

Mark Usage
! op_lognot operator
~ op_not operator
= op_set operator
||= op_logor_set operator
&&= op_logand_set operator
|= op_or_set operator
^= op_xor_set operator
&= op_and_set operator
<<= op_shl_set operator
>>= op_shr_set operator
+= op_add_set operator
-= op_sub_set operator
*= op_mul_set operator
/= op_div_set operator
//= op_intdiv_set operator
%= op_rem_set operator
**= op_pow_set operator
|| op_logor operator
&& op_logand operator
== op_eq operator
!= op_ne operator
< op_lt operator
> op_gt operator, or formal receiver
<= op_le operator
>= op_ge operator
<=> ompare operator
=~ op_match operator
| op_or operator
^ op_xor operator
& op_and operator
<< op_shl operator
>> op_shr operator
+ op_add operator
- op_sub operator, or op_minus operator
* op_mul operator, or formal rest arguments
/ op_div operator
// op_intdiv operator
% op_rem operator
** op_pow operator
.. op_range_ii operator
<.. op_range_ei operator
..< op_range_ie operator
<..< op_range_ee operator
: Variable reference
\ Pseudo variable
$ Dereference of a verb variable
. Access to variables or functions
[ Opening bracket of a list expression
] Closing bracket of a list expression
{ Opening brace of a function expression, or function as an actual argument (just after a call, without line feed characters)
} Closing brace
( Opening parenthesis for higher operation precedence, or opening parenthesis for formal arguments (just after {, without line feed characters) or opening parenthesis for actual arguments (just after a verb, without whitespace or line feed characters)
) Closing parenthesis
-> Separator between formal arguments and a body chunk
*** Expanding elements