The main file is LingusticProcessor.py. It handles all the questions of the
first part of the exercise. The other files correspond to each question. The
main.py is to execute in order all the question which have been implemented so
* The check_tokenization file is for tests.
* In order to run main.py we have to put the texts under the folder wikipedia
in the same path as the source.
* The output of each step is in a corresponding file recognized by the
extension such as .tokenized, .analyzed etc
* CountLemma is used in two cases. It can count the lemmata in a file and
build its inverted index, or create the inverted index in a collection of
files. The final format of those two are slightly (or not so slightly