Some short entries are not indexed in Pensieve
Original [issue 159](https://code.google.com/p/okapi/issues/detail?id=159) created by @ysavourel on 2011-01-17T12:03:51.000Z:
Entries like "From" are not retrieved when doing search on Pensieve.
See email here: http://tech.groups.yahoo.com/group/okapitools/message/1746
Guess: such entries get weeded out when indexing because they are made of stop words?
Comments (11)
-
Account Deleted -
Account Deleted Comment [2.](https://code.google.com/p/okapi/issues/detail?id=159#c2) originally posted by @ysavourel on 2011-01-17T13:21:58.000Z:
Maybe a possible workaround could be to detect in searchFuzzy() that the fQuery variable is empty after tokenization, and call searchExact() or some other better method that would at least give a hit for exact match or case-insensitive match?
-
Account Deleted - changed status to open
Comment [3.](https://code.google.com/p/okapi/issues/detail?id=159#c3) originally posted by @ysavourel on 2011-01-18T19:16:02.000Z:
Probably the easiest fix is to detect empty string after tokenization (where ngrams are sop listed)- if we find an empty string we can retokenize without stop listing. Should be an option or if not easy enough to wrap a new tokenizer.
Will try to take care of this Friday after RCP training. Maybe sooner.
-
Account Deleted Comment [4.](https://code.google.com/p/okapi/issues/detail?id=159#c4) originally posted by @ysavourel on 2011-01-25T20:08:37.000Z:
something very strange with the lucene ngram tokenzier - may be problems beyond just short/noisy strings.
-
Account Deleted - changed status to resolved
Comment [5.](https://code.google.com/p/okapi/issues/detail?id=159#c5) originally posted by @ysavourel on 2011-01-25T23:16:17.000Z:
Fixed some actual bugs in the tokenizer where strings less then ngram size got stripped. Also fixed the cases like "from" by removing all full word ngrams from the stop list. All test cases are commented out and pass.
-
Account Deleted - changed status to new
Comment [6.](https://code.google.com/p/okapi/issues/detail?id=159#c6) originally posted by @ysavourel on 2011-03-25T11:52:17.000Z:
It seems we still have problems with words shorter than 3 chars, like "a", "of", etc. See report here: http://tech.groups.yahoo.com/group/okapitools/message/1900 I've also added the unit test searchOnNoiseAndVeryShortWords() in PensieveSeekerTest.
-
Account Deleted Comment [7.](https://code.google.com/p/okapi/issues/detail?id=159#c7) originally posted by @ysavourel on 2011-03-25T16:12:20.000Z:
a, of and many other words (actually 4-grams are filtered) are filtered out for performance reasons. If we enable these it will cost us a huge performance hit.
Its time to refactor pensieve - but not sure when we will have time. I would really like to start fresh with lucene 4.x, new algorithms etc.
Here is the list of stop 4-grams - note that "option" would be stop listed based on the combined ngrams in the list. Perhaps we can weed this list a bit to allow a few more words.
If we can prove that a word is not being indexed/retrieved other than whats in the stop list then there is a another bug.
Jim
-
Account Deleted Comment [8.](https://code.google.com/p/okapi/issues/detail?id=159#c8) originally posted by @ysavourel on 2011-03-25T17:33:17.000Z:
I think it's more an issue with query smaller than 4 chars. For example: "am" is not in the stop list and won't be found. Same for "zq", which is definitely not in the list.
-
Account Deleted Comment [9.](https://code.google.com/p/okapi/issues/detail?id=159#c9) originally posted by @ysavourel on 2011-03-25T18:03:39.000Z:
Hum, I fixed that problem in the indexer at least - maybe they are getting filtered on the query. I will check this weekend if I can find a few minutes.
btw - this won't fix the "option" word not showing up - I will scan over the ngram table and remove a few ngrams that might prevent words like this from showing up.
as I said time for a refactor.
Jim
-
Account Deleted Comment [10.](https://code.google.com/p/okapi/issues/detail?id=159#c10) originally posted by @ysavourel on 2011-03-25T19:01:47.000Z:
Thanks. BTW: I think I saw " not " twice in there.
-
Account Deleted - changed status to resolved
Comment [11.](https://code.google.com/p/okapi/issues/detail?id=159#c11) originally posted by @ysavourel on 2011-04-01T18:52:22.000Z:
some words will still be filtered such as " to ", "from", etc.. But these are very common words and stop listing them makes the tm search much faster.
When we refactor penseive we will come up with a way to count these stop list words in the tm score.
- Log in to comment
Comment [1.](https://code.google.com/p/okapi/issues/detail?id=159#c1) originally posted by @ysavourel on 2011-01-17T12:22:21.000Z:
I've added a unit test in PensieveSeekerTest. See searchOnNoiseAndShortWords().