Source

orange-textable / docs / rst / strings_segments_segmentations.rst

Full commit

Strings, segments, and segmentations

The main purpose of Orange Textable is to build tables based on text strings. As we will see, there are several methods for importing text strings, the simplest of which is keyboard input using widget :ref:`Text Field` (see also :doc:`Keyboard input and segmentation display <keyboard_input_segmentation_display>`). Whenever a new string is imported, it is assigned a unique identification number (called string index) and stays in memory as long as the widget that imported it.

Consider the following string of 16 characters (note that whitespace counts as a character too), and let us suppose that its string index is 1:

Character a   s i m p l e   e x a m p l e
Position 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16

In this context, a segment is basically a substring of characters. Every segment has an address consisting of three elements:

  1. string index
  2. initial position within the string
  3. final position

In the case of a simple example, address (1, 3, 8) refers to substring simple, (1, 12, 12) to character a, and (1, 1, 16) to the entire string. The substring corresponding to a given address is called the segment's content.

A segmentation is an ordered list of segments. For instance, segmentation ((1, 1, 1 ), (1, 3, 8), (1, 10, 16)) contains 3 word segments, ((1, 1, 1), (1, 2, 2 ), ..., (1, 16, 16)) contains 16 character segments, and ((1, 1, 16)) contains a single segment covering the whole string.

As shown by the word segmentation example, every character in the string needs not be included in a segment. Moreover, a single character may belong to several segments simultaneously, as in ((1, 1, 1), (1, 1, 8), (1, 3, 8), (1, 3, 16), (1, 10, 16), (1, 3, 8)). This also shows that the order of segments in a segmentation can diverge from the order of the corresponding substrings in the string.

Exercise 1: What is the content of each of the 6 segments in the previous example? (:ref:`solution <solution_string_segments_segmentations_ex1>`)

In the previous examples, all the segments of a given segmentation refer to the same string. However, a segmentation may contain segments belonging to several distinct strings. Thus, if string another example has string index 2, segmentation ((2, 1, 7), (1, 3, 16)) is perfectly valid.

Exercise 2: What is the content of the segments in the previous example? (:ref:`solution <solution_string_segments_segmentations_ex2>`)

In order to store segmentations and transmit them between widgets, Orange Textable uses the Segmentation data type. Aside from the segment addresses, this data type associates a label with each segmentation, i.e. an arbitrary string used to identify the segmentation among others. [1]

Solution to exercise 1: a, a simple, simple, simple example, example, simple (in this order). (:ref:`back to the exercise <string_segments_segmentations_ex1>`)

Solution to exercise 2: another, simple example. (:ref:`back to the exercise <string_segments_segmentations_ex2>`)

[1]As we will see :doc:`later <annotations>`, the Segmentation data type can also store annotations associated with segments.