- changed status to resolved
Tokenizer does not return tokens for some characters.
Issue #752
resolved
Hello,
While working with the Tokenizer I realized it does not return tokens for some characters.
For instance calling Tokenizer.tokenize("®™℠£¥€", LocaleId.ENGLISH)
yields no tokens.
The documentation page on the Tokenization step says there is a CURRENCY category but the description is "Sums in US dollars." which makes me think that not tokenizing the currency symbols £, ¥, and € may be intended?
I was thinking that maybe all these symbols should have tokens? even with a token type of UNCATEGORIZED?
Below you can find a code snippet that illustrates this. I ran this test with the latest 0.37-SNAPSHOT (19-Sep-2018).
Thanks.
import net.sf.okapi.steps.tokenization.Tokenizer;
import net.sf.okapi.steps.tokenization.tokens.Tokens;
import net.sf.okapi.common.LocaleId;
class Example
{
public static void main(String[] args)
{
String sampleText = "This ® is a symbol.";
String invisibleChars = "®™℠£¥€";
String[] testStrings = {sampleText, invisibleChars};
for (String text : testStrings) {
System.out.printf("Text to tokekize: \"%s\"%n", text);
Tokens tokens = Tokenizer.tokenize(text, LocaleId.ENGLISH);
System.out.println("Tokens:");
System.out.println(tokens.toString());
System.out.println();
}
}
}
// OUTPUT:
// Text to tokekize: "This ® is a symbol."
// Tokens:
// WORD 18 100% This 200 0, 4 1
// WHITESPACE 17 100% 0 4, 5 10
// WHITESPACE 17 100% 0 6, 7 10
// WORD 18 50% is 200 7, 9 1
// STOPWORD 14 50% is 0 7, 9 9
// WHITESPACE 17 100% 0 9, 10 10
// WORD 18 50% a 200 10, 11 1
// STOPWORD 14 50% a 0 10, 11 9
// WHITESPACE 17 100% 0 11, 12 10
// WORD 18 100% symbol 200 12, 18 1
// PUNCTUATION 12 100% . 1 18, 19 10
//
// Text to tokekize: "®™℠£¥€"
// Tokens:
Comments (1)
-
- Log in to comment
This now works. I'm guessing the Tokenizer refactor or an ICU4J upgrade resolved the issues: