1. Matt Chaput
  2. whoosh
  3. Issues
Issue #215 resolved

Issue with most_distinctive_terms

Anonymous created an issue

I found this in the code of the distribution (file reading.py line 478):

{{{

!python

def most_distinctive_terms(self, fieldname, number=5, prefix=None): gen = ((terminfo.weight() * (1.0 / terminfo.doc_frequency()), text) for text, terminfo in self.iter_prefix(fieldname, prefix)) return nlargest(number, gen)

}}}

there is a minor mistake that prefix should be initialised to '' and not None.

More seriously, this is not the accepted definition of TF-IDF since IDF is the log of the ratio of the total number of documents (self.doc_count()) to the number of documents where the term appears (terminfo.doc_frequency()). So the proper definition should be given below.

{{{

!python

D = float(self.doc_count()) terminfo.weight() * log(D / terminfo.doc_frequency()) }}}

Comments (2)

  1. Log in to comment