Term Extraction excludes single characters

Issue #569 new
ysavourel created an issue

From Davide: (https://groups.yahoo.com/neo/groups/okapitools/conversations/messages/5215)

I am trying the Term Extraction feature of Rainbow. I noticed that Rainbow removes single characters words from term patterns. I'll try to explain this through an example.

I want to extract terminology from a 40 pages ODT file. In this text I inserted 4 or 5 times the text-string "Test a pattern" within different sentences. Then, I extracted the terms from this file using Rainbow (setting the "maximum number of words per term" to 4).

In the resulting termlist I find "pattern" and "test pattern" but never "test a pattern". Notice that the string "test pattern" is not present in the original 40 pages ODT file. I tried this several times (with and without stopwords) and both in English and Italian (using different files and test strings obviously) and the results are always the same. There must be some built-in rule that says to Rainbow to ignore monoliteral words.

I am trying to extract terms (trying to isolate User Interface commands) from computer manuals, so, as you can imagine, there is quite a difference between "Test pattern" and "Test a pattern" from a terminology point of view.

Am I doing something wrong or there is some bug in the software?

I am using Rainbow 6.0.31 on Linux, but I tested it on a Win 7 machine and it's just the same.

Comments (1)

  1. Log in to comment