Commits

Jacob Perkins committed 3399078

cleanup readme

Comments (0)

Files changed (1)

 Requirements
 ------------
 
-You must have Python 2.6 with `argparse <http://pypi.python.org/pypi/argparse/>`_ and `NLTK <http://www.nltk.org/>`_ 2.0 installed. `NumPy <http://numpy.scipy.org/>`_, `SciPy <http://www.scipy.org/>`_, and `megam <http://www.cs.utah.edu/~hal/megam/>`_ are recommended for training Maxent classifiers.
+You must have Python 2.6 with `argparse <http://pypi.python.org/pypi/argparse/>`_ and `NLTK <http://www.nltk.org/>`_ 2.0 installed. `NumPy <http://numpy.scipy.org/>`_, `SciPy <http://www.scipy.org/>`_, and `megam <http://www.cs.utah.edu/~hal/megam/>`_ are recommended for training Maxent classifiers. To use the sklearn classifiers, you must also install `scikit-learn <http://scikit-learn.org/stable/>`_.
 
 
-Training Classifiers
---------------------
+Documentation
+-------------
 
-Example usage with the movie_reviews corpus can be found in `Training Binary Text Classifiers with NLTK Trainer <http://streamhacker.com/2010/10/25/training-binary-text-classifiers-nltk-trainer/>`_.
-
-Train a binary NaiveBayes classifier on the movie_reviews corpus, using paragraphs as the training instances::
-	``python train_classifier.py --instances paras --classifier NaiveBayes movie_reviews``
-
-Include bigrams as features::
-	``python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 movie_reviews``
-
-Minimum score threshold::
-	``python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --min_score 3 movie_reviews``
-
-Maximum number of features::
-	``python train_classifier.py --instances paras --classifier NaiveBayes --ngrams 1 --ngrams 2 --max_feats 1000 movie_reviews``
-
-Use the default Maxent algorithm::
-	``python train_classifier.py --instances paras --classifier Maxent movie_reviews``
-
-Use the MEGAM Maxent algorithm::
-	``python train_classifier.py --instances paras --classifier MEGAM movie_reviews``
-
-Train on files instead of paragraphs::
-	``python train_classifier.py --instances files --classifier MEGAM movie_reviews``
-
-Train on sentences::
-	``python train_classifier.py --instances sents --classifier MEGAM movie_reviews``
-
-Evaluate the classifier by training on 3/4 of the paragraphs and testing against the remaing 1/4, without pickling::
-	``python train_classifier.py --instances paras --classifier NaiveBayes --fraction 0.75 --no-pickle movie_reviews``
-
-For a complete list of usage options::
-	``python train_classifier.py --help``
-
-
-Using a Trained Classifier
---------------------------
-
-You can use a trained classifier by loading the pickle file using `nltk.data.load <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.data-module.html#load>`_::
-	>>> import nltk.data
-	>>> classifier = nltk.data.load("classifiers/NAME_OF_CLASSIFIER.pickle")
-
-Or if your classifier pickle file is not in a ``nltk_data`` subdirectory, you can load it with `pickle.load <http://docs.python.org/library/pickle.html#pickle.load>`_::
-	>>> import pickle
-	>>> classifier = pickle.load(open("/path/to/NAME_OF_CLASSIFIER.pickle"))
-
-Either method will return an object that supports the `ClassifierI interface <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.classify.api.ClassifierI-class.html>`_. 
-
-Once you have a ``classifier`` object, you can use it to classify word features with the ``classifier.classify(feats)`` method, which returns a label::
-	>>> words = ['some', 'words', 'in', 'a', 'sentence']
-	>>> feats = dict([(word, True) for word in words])
-	>>> classifier.classify(feats)
-
-If you used the ``--ngrams`` option with values greater than 1, you should include these ngrams in the dictionary using `nltk.util.ngrams(words, n) <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.util-module.html#ngrams>`_::
-	>>> from nltk.util import ngrams
-	>>> words = ['some', 'words', 'in', 'a', 'sentence']
-	>>> feats = dict([(word, True) for word in words + ngrams(words, n)])
-	>>> classifier.classify(feats)
-
-The list of words you use for creating the feature dictionary should be created by `tokenizing <http://text-processing.com/demo/tokenize/>`_ the appropriate text instances: sentences, paragraphs, or files depending on the ``--instances`` option.
-
-
-Training Part of Speech Taggers
--------------------------------
-
-The ``train_tagger.py`` script can use any corpus included with NLTK that implements a ``tagged_sents()`` method. It can also train on the ``timit`` corpus, which includes tagged sentences that are not available through the ``TimitCorpusReader``.
-
-Example usage can be found in `Training Part of Speech Taggers with NLTK Trainer <http://streamhacker.com/2011/03/21/training-part-speech-taggers-nltk-trainer/>`_.
-
-Train the default sequential backoff tagger on the treebank corpus::
-	``python train_tagger.py treebank``
-
-To use a brill tagger with the default initial tagger::
-	``python train_tagger.py treebank --brill``
-
-To train a NaiveBayes classifier based tagger, without a sequential backoff tagger::
-	``python train_tagger.py treebank --sequential '' --classifier NaiveBayes``
-
-To train a unigram tagger::
-	``python train_tagger.py treebank --sequential u``
-
-To train on the switchboard corpus::
-	``python train_tagger.py switchboard``
-
-To train on a custom corpus, whose fileids end in ".pos", using a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_::
-	``python train_tagger.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'``
-
-The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work.
-
-You can also restrict the files used with the ``--fileids`` option::
-	``python train_tagger.py conll2000 --fileids train.txt``
-
-For a complete list of usage options::
-	``python train_tagger.py --help``
-
-
-Using a Trained Tagger
-----------------------
-
-You can use a trained tagger by loading the pickle file using `nltk.data.load <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.data-module.html#load>`_::
-	>>> import nltk.data
-	>>> tagger = nltk.data.load("taggers/NAME_OF_TAGGER.pickle")
-
-Or if your tagger pickle file is not in a ``nltk_data`` subdirectory, you can load it with `pickle.load <http://docs.python.org/library/pickle.html#pickle.load>`_::
-	>>> import pickle
-	>>> tagger = pickle.load(open("/path/to/NAME_OF_TAGGER.pickle"))
-
-Either method will return an object that supports the `TaggerI interface <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.tag.api.TaggerI-class.html>`_.
-
-Once you have a ``tagger`` object, you can use it to tag sentences (or lists of words) with the ``tagger.tag(words)`` method::
-	>>> tagger.tag(['some', 'words', 'in', 'a', 'sentence'])
-
-``tagger.tag(words)`` will return a list of 2-tuples of the form ``[(word, tag)]``.
-
-
-Analyzing Tagger Coverage
--------------------------
-
-The ``analyze_tagger_coverage.py`` script will run a part-of-speech tagger on a corpus to determine how many times each tag is found. Example output can be found in `Analyzing Tagged Corpora and NLTK Part of Speech Taggers <http://streamhacker.com/2011/03/23/analyzing-tagged-corpora-nltk-part-speech-taggers/>`_.
-
-Here's an example using the NLTK default tagger on the treebank corpus::
-	``python analyze_tagger_coverage.py treebank``
-
-To get detailed metrics on each tag, you can use the ``--metrics`` option. This requires using a tagged corpus in order to compare actual tags against tags found by the tagger. See `NLTK Default Tagger Treebank Tag Coverage <http://streamhacker.com/2011/01/24/nltk-default-tagger-treebank-tag-coverage/>`_ and `NLTK Default Tagger CoNLL2000 Tag Coverage <http://streamhacker.com/2011/01/25/nltk-default-tagger-conll2000-tag-coverage/>`_ for examples and statistics.
-
-To analyze the coverage of a different tagger, use the ``--tagger`` option with a path to the pickled tagger::
-	``python analyze_tagger_coverage.py treebank --tagger /path/to/tagger.pickle``
-
-To analyze coverage on a custom corpus, whose fileids end in ".pos", using a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_::
-	``python analyze_tagger_coverage.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'``
-
-The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work.
-
-For a complete list of usage options::
-	``python analyze_tagger_coverage.py --help``
-
-
-Analyzing a Tagged Corpus
--------------------------
-
-The ``analyze_tagged_corpus.py`` script will show the following statistics about a tagged corpus:
-
- * total number of words
- * number of unique words
- * number of tags
- * the number of times each tag occurs
-
-Example output can be found in `Analyzing Tagged Corpora and NLTK Part of Speech Taggers <http://streamhacker.com/2011/03/23/analyzing-tagged-corpora-nltk-part-speech-taggers/>`_.
-
-To analyze the treebank corpus::
-	``python analyze_tagged_corpus.py treebank``
-
-To sort the output by tag count from highest to lowest::
-	``python analyze_tagged_corpus.py treebank --sort count --reverse``
-
-To see simplified tags, instead of standard tags::
-	``python analyze_tagged_corpus.py treebank --simplify_tags``
-
-To analyze a custom corpus, whose fileids end in ".pos", using a `TaggedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.tagged.TaggedCorpusReader-class.html>`_::
-	``python analyze_tagged_corpus.py /path/to/corpus --reader nltk.corpus.reader.tagged.TaggedCorpusReader --fileids '.+\.pos'``
-
-The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work.
-
-For a complete list of usage options::
-	``python analyze_tagged_corpus.py --help``
-
-
-Training IOB Chunkers
----------------------
-
-The ``train_chunker.py`` script can use any corpus included with NLTK that implements a ``chunked_sents()`` method.
-
-Train the default sequential backoff tagger based chunker on the treebank_chunk corpus::
-	``python train_chunker.py treebank_chunk``
-
-To train a NaiveBayes classifier based chunker::
-	``python train_chunker.py treebank_chunk --classifier NaiveBayes``
-
-To train on the conll2000 corpus::
-	``python train_chunker.py conll2000``
-
-To train on a custom corpus, whose fileids end in ".pos", using a `ChunkedCorpusReader <http://nltk.googlecode.com/svn/trunk/doc/api/nltk.corpus.reader.chunked.ChunkedCorpusReader-class.html>`_::
-	``python train_chunker.py /path/to/corpus --reader nltk.corpus.reader.chunked.ChunkedCorpusReader --fileids '.+\.pos'``
-
-The corpus path can be absolute, or relative to a nltk_data directory. For example, both ``corpora/treebank/tagged`` and ``/usr/share/nltk_data/corpora/treebank/tagged`` will work.
-
-You can also restrict the files used with the ``--fileids`` option::
-	``python train_chunker.py conll2000 --fileids train.txt``
-
-For a complete list of usage options::
-	``python train_chunker.py --help``
+Documentation can be found at `nltk-trainer.readthedocs.org <http://nltk-trainer.readthedocs.org/en/latest/>`_ (you can also find these documents in the `docs directory <https://github.com/japerk/nltk-trainer/tree/master/docs>`_. Every script also provides a ``--help`` option that describes all available parameters.