orange-textable / docs / rst / counting_segment_types.rst

Full commit

Counting segment types

Widget :ref:`Count` takes in input one or more segmentations and produces frequency tables such as tables 1 and 2 :doc:`here <segmentations_tables>`. To try it out, create a scheme such as illustrated on :ref:`figure 1 <counting_segment_types_fig1>` below. As usual, we will suppose that the :ref:`Text Field` instance contains a simple example. The :ref:`Segment` instance is configured for letter segmentation (Regex: \w and Output segmentation label: letters). The default configuration of the instances of :ref:`Convert` and Data Table (from the Data tab of Orange Canvas) needs not be modified for this example.

Scheme for testing the Count widget

Figure 1: Scheme for testing the :ref:`Count` widget.

Basically, the purpose of widget :ref:`Count` is to determine the frequency of segment types in an input segmentation. The label of that segmentation must be indicated in the Segmentation menu of section Units in the widget's interface, while other controls may be left in their default state for now (see :ref:`figure 2 <counting_segment_types_fig2>` below). Clicking Compute then double-clicking the Data Table instance should display essentially the same data as table 1 :ref:`here <segmentations_tables_table1>` (with possible variations in the order of columns).

Counting the frequency of letter types with widget :ref:`Count`

Figure 2: Counting the frequency of letter types with widget :ref:`Count`.

Note that checkbox Compute automatically is unchecked by default so that the user must click on Compute to trigger computations. The motivation for this default setting is that :doc:`table construction widgets <table_construction_widgets>` can be quite slow when operating on large segmentations, and it can be annoying to see computations starting again whenever an interface element is modified.

To obtain the frequency of letter bigrams (i.e. pairs of successive letters), simply set parameter Sequence length to 2 (see :ref:`table 1 <counting_segment_types_table1>` below). If the value of this parameter is greated than 1, the string specified in field Intra-sequence delimiter is inserted between successive segments for the sake of readability--which is more useful when segments are longer than individual letters. Note that in this example, word boundaries are not taken into account--nor even known, in fact--which is why bigrams as and ee have a nonzero frequency.

Table 1: Letter bigram frequency.
as si im mp pl le ex xa am
1 1 1 2 2 2 1 1 1