Issue with most_distinctive_terms
I found this in the code of the distribution (file reading.py line 478):
def most_distinctive_terms(self, fieldname, number=5, prefix=None): gen = ((terminfo.weight() * (1.0 / terminfo.doc_frequency()), text) for text, terminfo in self.iter_prefix(fieldname, prefix)) return nlargest(number, gen)
there is a minor mistake that prefix should be initialised to '' and not None.
More seriously, this is not the accepted definition of TF-IDF since IDF is the log of the ratio of the total number of documents (self.doc_count()) to the number of documents where the term appears (terminfo.doc_frequency()). So the proper definition should be given below.
D = float(self.doc_count()) terminfo.weight() * log(D / terminfo.doc_frequency()) }}}