Tokenizer does not return tokens for some characters.

Issue #752 resolved
Sergio Medina created an issue

Hello,

While working with the Tokenizer I realized it does not return tokens for some characters. For instance calling Tokenizer.tokenize("®™℠£¥€", LocaleId.ENGLISH) yields no tokens.

The documentation page on the Tokenization step says there is a CURRENCY category but the description is "Sums in US dollars." which makes me think that not tokenizing the currency symbols £, ¥, and € may be intended?

I was thinking that maybe all these symbols should have tokens? even with a token type of UNCATEGORIZED?

Below you can find a code snippet that illustrates this. I ran this test with the latest 0.37-SNAPSHOT (19-Sep-2018).

Thanks.

import net.sf.okapi.steps.tokenization.Tokenizer;
import net.sf.okapi.steps.tokenization.tokens.Tokens;
import net.sf.okapi.common.LocaleId;

class Example
{
    public static void main(String[] args)
    {
        String sampleText = "This ® is a symbol.";
        String invisibleChars = "®™℠£¥€";
        String[] testStrings = {sampleText, invisibleChars};

        for (String text : testStrings) {
            System.out.printf("Text to tokekize: \"%s\"%n", text);
            Tokens tokens = Tokenizer.tokenize(text, LocaleId.ENGLISH);
            System.out.println("Tokens:");
            System.out.println(tokens.toString());
            System.out.println();
        }
    }
}

// OUTPUT:
// Text to tokekize: "This ® is a symbol."
// Tokens:
// WORD             18  100%    This                 200       0,    4     1
// WHITESPACE       17  100%                           0       4,    5    10
// WHITESPACE       17  100%                           0       6,    7    10
// WORD             18   50%    is                   200       7,    9     1
// STOPWORD         14   50%    is                     0       7,    9     9
// WHITESPACE       17  100%                           0       9,   10    10
// WORD             18   50%    a                    200      10,   11     1
// STOPWORD         14   50%    a                      0      10,   11     9
// WHITESPACE       17  100%                           0      11,   12    10
// WORD             18  100%    symbol               200      12,   18     1
// PUNCTUATION      12  100%    .                      1      18,   19    10
//
// Text to tokekize: "®™℠£¥€"
// Tokens:

Comments (1)

  1. jhargrave-straker

    This now works. I'm guessing the Tokenizer refactor or an ICU4J upgrade resolved the issues:

    Text to tokekize: "®™℠£¥€"
    Tokens:
    OTHER_SYMBOL    512 ®                          0,    1
    OTHER_SYMBOL    512                           1,    2
    OTHER_SYMBOL    512                           2,    3
    CURRENCY        514 £                          3,    4
    CURRENCY        514 ¥                          4,    5
    CURRENCY        514                           5,    6
    
  2. Log in to comment