nltk-trainer / docs / train_tagger.rst

Training Part of Speech Taggers

The train_tagger.py script can use any corpus included with NLTK that implements a tagged_sents() method. It can also train on the timit corpus, which includes tagged sentences that are not available through the TimitCorpusReader.

Example usage can be found in Training Part of Speech Taggers with NLTK Trainer.

Train the default sequential backoff tagger on the treebank corpus::
python train_tagger.py treebank
To use a brill tagger with the default initial tagger::
python train_tagger.py treebank --brill
To train a NaiveBayes classifier based tagger, without a sequential backoff tagger::
python train_tagger.py treebank --sequential '' --classifier NaiveBayes
To train a unigram tagger::
python train_tagger.py treebank --sequential u
To train on the switchboard corpus::
python train_tagger.py switchboard
To train on a custom corpus, whose fileids end in ".pos", using a TaggedCorpusReader::
python train_tagger.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'

The corpus path can be absolute, or relative to a nltk_data directory. For example, both corpora/treebank/tagged and /usr/share/nltk_data/corpora/treebank/tagged will work.

You can also restrict the files used with the --fileids option::
python train_tagger.py conll2000 --fileids train.txt
For a complete list of usage options::
python train_tagger.py --help

Using a Trained Tagger

You can use a trained tagger by loading the pickle file using nltk.data.load::
>>> import nltk.data
>>> tagger = nltk.data.load("taggers/NAME_OF_TAGGER.pickle")
Or if your tagger pickle file is not in a nltk_data subdirectory, you can load it with pickle.load::
>>> import pickle
>>> tagger = pickle.load(open("/path/to/NAME_OF_TAGGER.pickle"))

Either method will return an object that supports the TaggerI interface.

Once you have a tagger object, you can use it to tag sentences (or lists of words) with the tagger.tag(words) method::
>>> tagger.tag(['some', 'words', 'in', 'a', 'sentence'])

tagger.tag(words) will return a list of 2-tuples of the form [(word, tag)].

Tip: Filter by directory path e.g. /media app.js to search for public/media/app.js.
Tip: Use camelCasing e.g. ProjME to search for ProjectModifiedEvent.java.
Tip: Filter by extension type e.g. /repo .js to search for all .js files in the /repo directory.
Tip: Separate your search with spaces e.g. /ssh pom.xml to search for src/ssh/pom.xml.
Tip: Use ↑ and ↓ arrow keys to navigate and return to view the file.
Tip: You can also navigate files with Ctrl+j (next) and Ctrl+k (previous) and view the file with Ctrl+o.
Tip: You can also navigate files with Alt+j (next) and Alt+k (previous) and view the file with Alt+o.