Some short entries are not indexed in Pensieve

Issue #159 resolved
Former user created an issue

Original [issue 159](https://code.google.com/p/okapi/issues/detail?id=159) created by @ysavourel on 2011-01-17T12:03:51.000Z:

Entries like "From" are not retrieved when doing search on Pensieve.

See email here: http://tech.groups.yahoo.com/group/okapitools/message/1746

Guess: such entries get weeded out when indexing because they are made of stop words?

Comments (11)

  1. Former user Account Deleted

    Comment [2.](https://code.google.com/p/okapi/issues/detail?id=159#c2) originally posted by @ysavourel on 2011-01-17T13:21:58.000Z:

    Maybe a possible workaround could be to detect in searchFuzzy() that the fQuery variable is empty after tokenization, and call searchExact() or some other better method that would at least give a hit for exact match or case-insensitive match?

  2. Former user Account Deleted
    • changed status to open

    Comment [3.](https://code.google.com/p/okapi/issues/detail?id=159#c3) originally posted by @ysavourel on 2011-01-18T19:16:02.000Z:

    Probably the easiest fix is to detect empty string after tokenization (where ngrams are sop listed)- if we find an empty string we can retokenize without stop listing. Should be an option or if not easy enough to wrap a new tokenizer.

    Will try to take care of this Friday after RCP training. Maybe sooner.

  3. Former user Account Deleted

    Comment [7.](https://code.google.com/p/okapi/issues/detail?id=159#c7) originally posted by @ysavourel on 2011-03-25T16:12:20.000Z:

    a, of and many other words (actually 4-grams are filtered) are filtered out for performance reasons. If we enable these it will cost us a huge performance hit.

    Its time to refactor pensieve - but not sure when we will have time. I would really like to start fresh with lucene 4.x, new algorithms etc.

    Here is the list of stop 4-grams - note that "option" would be stop listed based on the combined ngrams in the list. Perhaps we can weed this list a bit to allow a few more words.

    If we can prove that a word is not being indexed/retrieved other than whats in the stop list then there is a another bug.

    Jim

  4. Former user Account Deleted

    Comment [9.](https://code.google.com/p/okapi/issues/detail?id=159#c9) originally posted by @ysavourel on 2011-03-25T18:03:39.000Z:

    Hum, I fixed that problem in the indexer at least - maybe they are getting filtered on the query. I will check this weekend if I can find a few minutes.

    btw - this won't fix the "option" word not showing up - I will scan over the ngram table and remove a few ngrams that might prevent words like this from showing up.

    as I said time for a refactor.

    Jim

  5. Former user Account Deleted

    Comment [11.](https://code.google.com/p/okapi/issues/detail?id=159#c11) originally posted by @ysavourel on 2011-04-01T18:52:22.000Z:

    some words will still be filtered such as " to ", "from", etc.. But these are very common words and stop listing them makes the tm search much faster.

    When we refactor penseive we will come up with a way to count these stop list words in the tm score.

  6. Log in to comment