nltk-trainer / docs / train_classifier.rst

Training Classifiers

Example usage with the movie_reviews corpus can be found in Training Binary Text Classifiers with NLTK Trainer.

Train a binary NaiveBayes classifier on the movie_reviews corpus, using paragraphs as the training instances:
python train_classifier.py --instances paras --classifier NaiveBayes movie_reviews
Include bigrams as features:
python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 movie_reviews
Minimum score threshold:
python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --min_score 3 movie_reviews
Maximum number of features:
python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --max_feats 1000 movie_reviews
Use the default Maxent algorithm:
python train_classifier.py --instances paras --classifier Maxent movie_reviews
Use the MEGAM Maxent algorithm:
python train_classifier.py --instances paras --classifier MEGAM movie_reviews
Train on files instead of paragraphs:
python train_classifier.py --instances files --classifier MEGAM movie_reviews
Train on sentences:
python train_classifier.py --instances sents --classifier MEGAM movie_reviews
Evaluate the classifier by training on 3/4 of the paragraphs and testing against the remaing 1/4, without pickling:
python train_classifier.py --instances paras --classifier NaiveBayes --fraction 0.75 --no-pickle movie_reviews

The following classifiers are available:

If you also have scikit-learn then the following classifiers will also be available, with sklearn specific training options. If there is a sklearn classifier or training option you want that is not present, please submit an issue.

For a complete list of usage options:
python train_classifier.py --help

Using a Trained Classifier

You can use a trained classifier by loading the pickle file using nltk.data.load:
>>> import nltk.data
>>> classifier = nltk.data.load("classifiers/NAME_OF_CLASSIFIER.pickle")
Or if your classifier pickle file is not in a nltk_data subdirectory, you can load it with pickle.load:
>>> import pickle
>>> classifier = pickle.load(open("/path/to/NAME_OF_CLASSIFIER.pickle"))

Either method will return an object that supports the ClassifierI interface.

Once you have a classifier object, you can use it to classify word features with the classifier.classify(feats) method, which returns a label:
>>> words = ['some', 'words', 'in', 'a', 'sentence']
>>> feats = dict([(word, True) for word in words])
>>> classifier.classify(feats)
If you used the --ngrams option with values greater than 1, you should include these ngrams in the dictionary using nltk.util.ngrams(words, n):
>>> from nltk.util import ngrams
>>> words = ['some', 'words', 'in', 'a', 'sentence']
>>> feats = dict([(word, True) for word in words + ngrams(words, n)])
>>> classifier.classify(feats)

The list of words you use for creating the feature dictionary should be created by tokenizing the appropriate text instances: sentences, paragraphs, or files depending on the --instances option.

Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.