Issue with most_distinctive_terms

Issue #215 resolved
Anonymous created an issue

I found this in the code of the distribution (file line 478):



def most_distinctive_terms(self, fieldname, number=5, prefix=None): gen = ((terminfo.weight() * (1.0 / terminfo.doc_frequency()), text) for text, terminfo in self.iter_prefix(fieldname, prefix)) return nlargest(number, gen)


there is a minor mistake that prefix should be initialised to '' and not None.

More seriously, this is not the accepted definition of TF-IDF since IDF is the log of the ratio of the total number of documents (self.doc_count()) to the number of documents where the term appears (terminfo.doc_frequency()). So the proper definition should be given below.



D = float(self.doc_count()) terminfo.weight() * log(D / terminfo.doc_frequency()) }}}

Comments (2)

  1. Log in to comment