Commits

David McClosky committed 3b9cf0e

New Python module release: RerankingParser updated, ModelFetcher added
python/bllipparser/RerankingParser.py: Various improvements.
Sentence class now ensures tokens are strings to reduce crashing.
RerankingParser class:
- load_unified_model_dir() renamed to
from_unified_model_dir(), old method name is deprecated now.
- Parser options can now be changed with the set_parser_options()
method.
- parse() and parse_tagged() both default to a new rerank
mode, rerank='auto', which will only rerank if a reranker model
is available.
- parse_tagged() now throws a ValueError if you provide an invalid
POS tag (instead of segfaulting)
- check_loaded_models() renamed to _check_loaded_models() since it's
not intended for users.
- added get_unified_model_parameters() helper function which
provides paths to parser and reranker model files.
python/bllipparser/ModelFetcher.py: new Python module which downloads
and installs BLLIP unified parsing models. Can be used via command
line or Python library.
python/bllipparser/ParsingShell.py: can now be launched without parsing
models
README-python.txt: docs, examples updated. Now covers ModelFetcher
and ParsingShell (the latter was previously distributed but not
mentioned)
setup.py: updated with latest release information

Comments (0)

Files changed (5)

README-python.txt

 The BLLIP parser (also known as the Charniak-Johnson parser or
 Brown Reranking Parser) is described in the paper `Charniak
 and Johnson (Association of Computational Linguistics, 2005)
-<http://aclweb.org/anthology/P/P05/P05-1022.pdf>`_.  This code
-provides a Python interface to the parser. Note that it does
-not contain any parsing models which must be downloaded
-separately (for example, `WSJ self-trained parsing model
-<http://cs.brown.edu/~dmcc/selftraining/selftrained.tar.gz>`_).
+<http://aclweb.org/anthology/P/P05/P05-1022.pdf>`_.  This package provides
+the BLLIP parser runtime along with a Python interface. Note that it
+does not come with any parsing models but includes a downloader.
 The primary maintenance for the parser takes place at `GitHub
 <http://github.com/BLLIP/bllip-parser>`_.
 
+Fetching parsing models
+-----------------------
+
+Before you can parse, you'll need some parsing models.  ``ModelFetcher``
+will help you download and install parsing models.  It can be invoked
+from the command line. For example, this will download and install the
+standard WSJ model::
+
+    shell% python -mbllipparser.ModelFetcher -i WSJ
+
+Run ``python -mbllipparser.ModelFetcher`` with no arguments for a full
+listing of options and available parsing models. It can also be invoked
+as a Python library::
+
+    >>> from bllipparser.ModelFetcher import download_and_install_model
+    >>> download_and_install_model('WSJ', '/tmp/models')
+    /tmp/models/WSJ
+
+In this case, it would download WSJ and install it to
+``/tmp/models/WSJ``. Note that it returns the path to the downloaded
+model.
+
 Basic usage
 -----------
 
 The easiest way to construct a parser is with the
-``load_unified_model_dir`` class method. A unified model is a directory
+``from_unified_model_dir`` class method. A unified model is a directory
 that contains two subdirectories: ``parser/`` and ``reranker/``, each
 with the respective model files::
 
     >>> from bllipparser import RerankingParser, tokenize
-    >>> rrp = RerankingParser.load_unified_model_dir('/path/to/model/')
+    >>> rrp = RerankingParser.from_unified_model_dir('/path/to/model/')
+
+This can be integrated with ModelFetcher (if the model is already
+installed, ``download_and_install_model`` is a no-op)::
+
+    >>> model_dir = download_and_install_model('WSJ', '/tmp/models')
+    >>> rrp = RerankingParser.from_unified_model_dir(model_dir)
+
+You can also load parser and reranker models manually::
+
+    >>> rrp = RerankingParser()
+    >>> rrp.load_parser_model('/tmp/models/WSJ/parser')
+    >>> rrp.load_reranker_model('/tmp/models/WSJ/reranker')
 
 Parsing a single sentence and reading information about the top parse
 with ``parse()``. The parser produces an *n-best list* of the *n* most
 
     >>> nbest_list = rrp.parse('Parser only!', rerank=False)
 
-Parsing text with existing POS tag (soft) constraints. In this example,
-token 0 ('Time') should have tag VB and token 1 ('flies') should have
-tag NNS::
+You can also parse text with existing POS tags (these act as soft
+constraints). In this example, token 0 ('Time') should have tag VB and
+token 1 ('flies') should have tag NNS::
 
     >>> rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : 'VB', 1 : 'NNS'})[0]
     ScoredParse('(S1 (NP (VB Time) (NNS flies)))', parser_score=-53.94938875760073, reranker_score=-15.841407102717749)
 
-You don't need to specify a tag for all words: token 0 ('Time') should
+You don't need to specify a tag for all words: Here, token 0 ('Time') should
 have tag VB and token 1 ('flies') is unconstrained::
 
     >>> rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : 'VB'})[0]
     ScoredParse('(S1 (S (VP (VB Time) (NP (VBZ flies)))))', parser_score=-54.390430751112156, reranker_score=-17.290145080887005)
 
-You can specify multiple tags for each token: token 0 ('Time') should
-have tag VB, JJ, or NN and token 1 ('flies') is unconstrained::
+You can specify multiple tags for each token. When you do this, the
+tags for a token will be used in decreasing priority. token 0 ('Time')
+should have tag VB, JJ, or NN and token 1 ('flies') is unconstrained::
 
     >>> rrp.parse_tagged(['Time', 'flies'], possible_tags={0 : ['VB', 'JJ', 'NN']})[0]
     ScoredParse('(S1 (NP (NN Time) (VBZ flies)))', parser_score=-42.82904107213723, reranker_score=-12.865900776775314)
 
+There are many parser options which can be adjusted (though the defaults
+should work well for most cases) with ``set_parser_options``. This
+will change the size of the n-best list and pick the defaults for all
+other options. It returns a dictionary of the current options::
+
+    >>> rrp.set_parser_options(nbest=10)
+    {'language': 'En', 'case_insensitive': False, 'debug': 0, 'small_corpus': True, 'overparsing': 21, 'smooth_pos': 0, 'nbest': 10}
+    >>> nbest_list = rrp.parse('The list is smaller now.', rerank=False)
+    >>> len(nbest_list)
+    10
+
 Use this if all you want is a tokenizer::
 
     >>> tokenize("Tokenize this sentence, please.")
     ['Tokenize', 'this', 'sentence', ',', 'please', '.']
+
+Parsing shell
+-------------
+
+There is an interactive shell which can help visualize a parse::
+
+    shell% python -mbllipparser.ParsingShell /path/to/model
+
+Once in the shell, type a sentence to have the parser parse it::
+
+    rrp> I saw the astronomer with the telescope.
+    Tokens: I saw the astronomer with the telescope .
+
+    Parser's parse:
+    (S1 (S (NP (PRP I))
+         (VP (VBD saw)
+          (NP (NP (DT the) (NN astronomer))
+           (PP (IN with) (NP (DT the) (NN telescope)))))
+         (. .)))
+
+    Reranker's parse: (parser index 2)
+    (S1 (S (NP (PRP I))
+         (VP (VBD saw)
+          (NP (DT the) (NN astronomer))
+          (PP (IN with) (NP (DT the) (NN telescope))))
+         (. .)))
+
+If you have ``nltk`` installed, you can use its tree visualization to
+see the output::
+
+    rrp> visual Show me this parse.
+    Tokens: Show me this parse .
+
+    [graphical display of the parse appears]
+
+There is more detailed help inside the shell under the ``help`` command.

python/bllipparser/ModelFetcher.py

+#!/usr/bin/env python
+# Licensed under the Apache License, Version 2.0 (the "License"); you may
+# not use this file except in compliance with the License.  You may obtain
+# a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
+# WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.  See the
+# License for the specific language governing permissions and limitations
+# under the License.
+
+"""Simple BLLIP Parser unified parsing model repository and installer."""
+from __future__ import division
+import sys, urlparse, urllib
+from os import makedirs, system, chdir, getcwd
+from os.path import basename, exists, join
+
+class ModelInfo:
+    def __init__(self, model_desc, url, uncompressed_size='unknown '):
+        """uncompressed_size is approximate size in megabytes."""
+        self.model_desc = model_desc
+        self.url = url
+        self.uncompressed_size = uncompressed_size
+    def __str__(self):
+        return "%s [%sMB]" % (self.model_desc, self.uncompressed_size)
+
+# should this grow large enough, we'll find a better place to store it
+models = {
+    'OntoNotes-WSJ' : ModelInfo('OntoNotes portion of WSJ', 'http://nlp.stanford.edu/~mcclosky/models/BLLIP-OntoNotes-WSJ.tar.bz2', 61),
+    'SANCL2012-Uniform' : ModelInfo('Self-trained model on OntoNotes-WSJ and the Google Web Treebank',
+                                    'http://nlp.stanford.edu/~mcclosky/models/BLLIP-SANCL2012-Uniform.tar.bz2', 890),
+    'WSJ+Gigaword' : ModelInfo('Self-trained model on PTB2-WSJ and approx. two million sentences from Gigaword',
+                               'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-Gigaword2000.tar.bz2', 473),
+    'WSJ+PubMed' : ModelInfo('Self-trained model on PTB2-WSJ and approx. 200k sentences from PubMed',
+                             'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-PubMed.tar.bz2', 152),
+    'WSJ' : ModelInfo('Wall Street Journal corpus from Penn Treebank, version 2',
+                      'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-no-AUX.tar.bz2', 52),
+    'WSJ-with-AUX' : ModelInfo('Wall Street Journal corpus from Penn Treebank, version 2 (AUXified version, deprecated)',
+                               'http://nlp.stanford.edu/~mcclosky/models/BLLIP-WSJ-with-AUX.tar.bz2', 55),
+}
+
+class UnknownParserModel(ValueError):
+    def __str__(self):
+        return "Unknown parser model name: " + self[0]
+
+def download_and_install_model(model_name, target_directory, verbose=False):
+    """Downloads and installs models to a specific directory. Models
+    can be specified by simple names (use list_models() for a list
+    of known models) or a URL. If the model is already installed in
+    target_directory, it won't download it again.  Returns the path to
+    the new model."""
+
+    if model_name.lower().startswith('http'):
+        parsed_url = urlparse.urlparse(model_name)
+        model_url = model_name
+        model_name = basename(parsed_url.path).split('.')[0]
+    elif model_name in models:
+        model_url = models[model_name].url
+    else:
+        raise UnknownParserModel(model_name)
+
+    output_path = join(target_directory, model_name)
+    if verbose:
+        print "Fetching model:", model_name, "from", model_url
+        print "Model directory:", output_path
+
+    if exists(output_path):
+        if verbose:
+            print "Model directory already exists, not reinstalling"
+        return output_path
+
+    if verbose:
+        def status_func(blocks, block_size, total_size):
+            amount_downloaded = blocks * block_size
+            if total_size == -1:
+                sys.stdout.write('Downloaded %s\r' % amount_downloaded)
+            else:
+                percent_downloaded = 100 * amount_downloaded / total_size
+                size = amount_downloaded / (1024 ** 2)
+                sys.stdout.write('Downloaded %.1f%% (%.1f MB)\r' % (percent_downloaded, size))
+    else:
+        status_func = None
+    downloaded_filename, headers = urllib.urlretrieve(model_url, reporthook=status_func)
+    if verbose:
+        sys.stdout.write('\rDownload complete' + (' ' * 20) + '\n')
+        print 'Downloaded to temporary file', downloaded_filename
+
+    try:
+        makedirs(output_path)
+    except OSError, ose:
+        if ose.errno != 17:
+            raise
+
+    orig_path = getcwd()
+    chdir(output_path)
+    # by convention, all models are currently in tar.bz2 format
+    # we may want to generalize this code later
+    assert downloaded_filename.lower().endswith('.bz2')
+    command = 'tar xvjf %s' % downloaded_filename
+    if verbose:
+        print "Extracting with %r to %s" % (command, output_path)
+    system(command)
+    chdir(orig_path)
+
+    return output_path
+
+def list_models():
+    print len(models), "known unified parsing models: [uncompressed size]"
+    for key, model_info in sorted(models.items()):
+        print '\t%-20s\t%s' % (key, model_info)
+
+def main():
+    from optparse import OptionParser
+    parser = OptionParser(usage="""%prog [options]
+
+Tool to help you download and install BLLIP Parser models.""")
+    parser.add_option("-l", "--list", action='store_true', help="List known parsing models.")
+    parser.add_option("-i", "--install", metavar="NAME", action='append',
+        help="Install a unified parser model.")
+    parser.add_option("-d","--directory", default='./models', metavar="PATH",
+        help="Directory to install parsing models in (will be created if it doesn't exist). Default: %default")
+
+    (options, args) = parser.parse_args()
+
+    if not (options.list or options.install):
+        parser.print_help()
+        # flip this on to make 'list' the default action
+        options.list = True
+        print
+    if options.list:
+        list_models()
+    if options.install:
+        for i, model in enumerate(options.install):
+            if i:
+                print
+            try:
+                ret = download_and_install_model(model, options.directory, verbose=True)
+            except UnknownParserModel, u:
+                print u
+                list_models()
+                sys.exit(1)
+
+if __name__ == "__main__":
+    main()

python/bllipparser/ParsingShell.py

 
 from bllipparser.RerankingParser import RerankingParser
 
+# TODO should integrate with bllipparser.ModelFetcher
+
 class ParsingShell(Cmd):
     def __init__(self, model):
         Cmd.__init__(self)
         self.prompt = 'rrp> '
         print "Loading models..."
-        self.rrp = RerankingParser.load_unified_model_dir(model)
+        if model is None:
+            self.rrp = None
+        else:
+            self.rrp = RerankingParser.from_unified_model_dir(model)
         self.last_nbest_list = []
 
     def do_visual(self, text):

python/bllipparser/RerankingParser.py

 lower-level (SWIG-generated) CharniakParser and JohnsonReranker modules
 so you don't need to interact with them directly."""
 
-import os.path
+from os.path import exists, join
 import CharniakParser as parser
 import JohnsonReranker as reranker
 
 class ScoredParse:
-    """Represents a single parse and its associated parser probability
-    and reranker score."""
+    """Represents a single parse and its associated parser
+    probability and reranker score. Note that ptb_parse is actually
+    a CharniakParser.InputTree rather than a string (str()ing it will
+    return the actual PTB parse."""
     def __init__(self, ptb_parse, parser_score=None, reranker_score=None,
                  parser_rank=None, reranker_rank=None):
         self.ptb_parse = ptb_parse
             self.sentrep = parser.tokenize('<s> ' + text_or_tokens + ' </s>',
                                            max_sentence_length)
         else:
+            # text_or_tokens is a sequence -- need to make sure that each
+            # element is a string to avoid crashing
+            text_or_tokens = map(str, text_or_tokens)
             self.sentrep = parser.SentRep(text_or_tokens)
     def get_tokens(self):
         tokens = []
         self._reranked = False
 
     def __getattr__(self, key):
-        """Defer anything unimplemented to our list of ScoredParse objects."""
+        """Delegate everything else to our list of ScoredParse objects."""
         return getattr(self.parses, key)
 
     def sort_by_reranker_scores(self):
             return parser.asNBestList(self._parses)
     def as_reranker_input(self, lowercase=True):
         """Convert the n-best list to an internal structure used as input
-        to the reranker.  You shouldn't typically need to call this."""
+        to the reranker. You shouldn't typically need to call this."""
         return reranker.readNBestList(str(self), lowercase)
 
 class RerankingParser:
     """Wraps the Charniak parser and Johnson reranker into a single
-    object. In general, the RerankingParser is not thread safe."""
+    object. Note that RerankingParser is not thread safe."""
     def __init__(self):
         """Create an empty reranking parser. You'll need to call
-        load_parsing_model() at minimum and load_reranker_model() if
-        you're using the reranker. See also the load_unified_model_dir()
+        load_parser_model() at minimum and load_reranker_model() if
+        you're using the reranker. See also the from_unified_model_dir()
         classmethod which will take care of calling both of these
         for you."""
         self._parser_model_loaded = False
         self.parser_model_dir = None
+        self.parser_options = {}
         self.reranker_model = None
         self._parser_thread_slot = parser.ThreadSlot()
         self.unified_model_dir = None
                 (self.__class__.__name__, self.parser_model_dir,
                  self.reranker_model)
 
-    def load_parsing_model(self, model_dir, language='En',
-                           case_insensitive=False, nbest=50, small_corpus=True,
-                           overparsing=21, debug=0, smoothPos=0):
+    def load_parser_model(self, model_dir, **parser_options):
         """Load the parsing model from model_dir and set parsing
-        options. In general, the default options should suffice. Note
-        that the parser does not allow loading multiple models within
-        the same process."""
+        options. In general, the default options should suffice but see
+        the set_parser_options() method for details. Note that the parser
+        does not allow loading multiple models within the same process
+        (calling this function twice will raise a RuntimeError)."""
         if self._parser_model_loaded:
-            raise ValueError('Parser is already loaded and can only be loaded once.')
-        if not os.path.exists(model_dir):
+            raise RuntimeError('Parser is already loaded and can only be loaded once.')
+        if not exists(model_dir):
             raise ValueError('Parser model directory %r does not exist.' % model_dir)
         self._parser_model_loaded = True
+        self.parser_model_dir = model_dir
         parser.loadModel(model_dir)
-        self.parser_model_dir = model_dir
-        parser.setOptions(language, case_insensitive, nbest, small_corpus,
-                          overparsing, debug, smoothPos)
+        self.set_parser_options(**parser_options)
 
     def load_reranker_model(self, features_filename, weights_filename,
                             feature_class=None):
         """Load the reranker model from its feature and weights files. A feature
         class may optionally be specified."""
-        if not os.path.exists(features_filename):
+        if not exists(features_filename):
             raise ValueError('Reranker features filename %r does not exist.' % \
                 features_filename)
-        if not os.path.exists(weights_filename):
+        if not exists(weights_filename):
             raise ValueError('Reranker weights filename %r does not exist.' % \
                 weights_filename)
         self.reranker_model = reranker.RerankerModel(feature_class,
                                                      features_filename,
                                                      weights_filename)
 
-    def parse(self, sentence, rerank=True, max_sentence_length=399):
+    def parse(self, sentence, rerank='auto', max_sentence_length=399):
         """Parse some text or tokens and return an NBestList with the
-        results.  sentence can be a string or a sequence.  If it is a
-        string, it will be tokenized.  If rerank is True, we will rerank
-        the n-best list."""
-        self.check_loaded_models(rerank)
+        results. sentence can be a string or a sequence. If it is a
+        string, it will be tokenized. If rerank is True, we will rerank
+        the n-best list, if False the reranker will not be used. rerank
+        can also be set to 'auto' which will only rerank if a reranker
+        model is loaded."""
+        rerank = self._check_loaded_models(rerank)
 
         sentence = Sentence(sentence, max_sentence_length)
         try:
             nbest_list.rerank(self)
         return nbest_list
 
-    def parse_tagged(self, tokens, possible_tags, rerank=True):
-        """Parse some pre-tagged, pre-tokenized text.  tokens is a
-        sequence of strings.  possible_tags is map from token indices
-        to possible POS tags.  Tokens without an entry in possible_tags
-        will be unconstrained by POS.  If rerank is True, we will
-        rerank the n-best list."""
-        self.check_loaded_models(rerank)
+    def parse_tagged(self, tokens, possible_tags, rerank='auto'):
+        """Parse some pre-tagged, pre-tokenized text. tokens must be a
+        sequence of strings. possible_tags is map from token indices
+        to possible POS tags (strings). Tokens without an entry in
+        possible_tags will be unconstrained by POS. POS tags must be
+        in the terms.txt file in the parsing model or else you will get
+        a ValueError. If rerank is True, we will rerank the n-best list,
+        if False the reranker will not be used. rerank can also be set to
+        'auto' which will only rerank if a reranker model is loaded."""
+        rerank = self._check_loaded_models(rerank)
+        if isinstance(tokens, basestring):
+            raise ValueError("tokens must be a sequence, not a string.")
 
         ext_pos = parser.ExtPos()
         for index in range(len(tokens)):
             tags = possible_tags.get(index, [])
             if isinstance(tags, basestring):
                 tags = [tags]
-            ext_pos.addTagConstraints(parser.VectorString(tags))
+            tags = map(str, tags)
+            valid_tags = ext_pos.addTagConstraints(parser.VectorString(tags))
+            if not valid_tags:
+                # at least one of the tags is bad -- find out which ones
+                # and throw a ValueError
+                self._find_bad_tag_and_raise_error(tags)
 
         sentence = Sentence(tokens)
         parses = parser.parse(sentence.sentrep, ext_pos,
             nbest_list.rerank(self)
         return nbest_list
 
-    def check_loaded_models(self, rerank):
+    def _find_bad_tag_and_raise_error(self, tags):
+        ext_pos = parser.ExtPos()
+        bad_tags = set()
+        for tag in set(tags):
+            good_tag = ext_pos.addTagConstraints(parser.VectorString([tag]))
+            if not good_tag:
+                bad_tags.add(tag)
+
+        raise ValueError("Invalid POS tags (not present in the parser's terms.txt file): %s" % ', '.join(sorted(bad_tags)))
+
+    def _check_loaded_models(self, rerank):
+        """Given a reranking mode (True, False, 'auto') determines
+        whether we have the appropriately loaded models. Also returns
+        whether the reranker should be used (essentially resolves the
+        value of rerank if rerank='auto')."""
         if not self._parser_model_loaded:
             raise ValueError("Parser model has not been loaded.")
-        if rerank and not self.reranker_model:
+        if rerank == True and not self.reranker_model:
             raise ValueError("Reranker model has not been loaded.")
+        if rerank == 'auto':
+            return bool(self.reranker_model)
+        else:
+            return rerank
+
+    def set_parser_options(self, language='En', case_insensitive=False,
+        nbest=50, small_corpus=True, overparsing=21, debug=0, smooth_pos=0):
+        """Set options for the parser. Note that this is called
+        automatically by load_parser_model() so you should only need to
+        call this to update the parsing options. The method returns a
+        dictionary of the new options.
+
+        The options are as follows: language is a string describing
+        the language. Currently, it can be one of En (English), Ch
+        (Chinese), or Ar (Arabic). case_insensitive will make the parser
+        ignore capitalization. nbest is the maximum size of the n-best
+        list. small_corpus=True enables additional smoothing (originally
+        intended for training from small corpora, but helpful in many
+        situations). overparsing determines how much more time the parser
+        will spend on a sentence relative to the time it took to find the
+        first possible complete parse. This affects the speed/accuracy
+        tradeoff. debug takes a non-negative integer. Setting it higher
+        than 0 will cause the parser to print debug messages (surprising,
+        no?). Setting smooth_pos to a number higher than 0 will cause the
+        parser to assign that value as the probability of seeing a known
+        word in a new part-of-speech (one never seen in training)."""
+        if not self._parser_model_loaded:
+            raise RuntimeError('Parser must already be loaded (call load_parser_model() first)')
+
+        parser.setOptions(language, case_insensitive, nbest, small_corpus,
+            overparsing, debug, smooth_pos)
+        self.parser_options = {
+            'language': language,
+            'case_insensitive': case_insensitive,
+            'nbest': nbest,
+            'small_corpus': small_corpus,
+            'overparsing': overparsing,
+            'debug': debug,
+            'smooth_pos': smooth_pos
+        }
+        return self.parser_options
 
     @classmethod
-    def load_unified_model_dir(this_class, model_dir, parsing_options=None,
+    def load_unified_model_dir(this_class, *args, **kwargs):
+        """Deprecated. Use from_unified_model_dir() instead as this
+        method will eventually disappear."""
+        import warnings
+        warnings.warn('BllipParser.load_parser_model() method is deprecated now, use BllipParser.from_unified_model_dir() instead.')
+        return this_class.from_unified_model_dir(*args, **kwargs)
+
+    @classmethod
+    def from_unified_model_dir(this_class, model_dir, parsing_options=None,
         reranker_options=None):
         """Create a RerankingParser from a unified parsing model on disk.
-        A unified parsing model should have the following filesystem structure:
+        A unified parsing model should have the following filesystem
+        structure:
         
         parser/
-            Charniak parser model: should contain pSgT.txt, *.g files,
-            and various others
+            Charniak parser model: should contain pSgT.txt, *.g files
+            among others
         reranker/
-            features.gz -- features for reranker
-            weights.gz -- corresponding weights of those features
+            features.gz or features.bz2 -- features for reranker
+            weights.gz or weights.bz2 -- corresponding weights of those
+            features
         """
         parsing_options = parsing_options or {}
         reranker_options = reranker_options or {}
+        (parser_model_dir, reranker_features_filename,
+         reranker_weights_filename) = get_unified_model_parameters(model_dir)
+
         rrp = this_class()
-        rrp.load_parsing_model(model_dir + '/parser/', **parsing_options)
+        if parser_model_dir:
+            rrp.load_parser_model(parser_model_dir, **parsing_options)
+        if reranker_features_filename and reranker_weights_filename:
+            rrp.load_reranker_model(reranker_features_filename,
+                reranker_weights_filename, **reranker_options)
 
-        reranker_model_dir = model_dir + '/reranker/'
-        features_filename = reranker_model_dir + 'features.gz'
-        weights_filename = reranker_model_dir + 'weights.gz'
-
-        rrp.load_reranker_model(features_filename, weights_filename,
-            **reranker_options)
         rrp.unified_model_dir = model_dir
         return rrp
 
     longer than max_sentence_length tokens, it will be truncated."""
     sentence = Sentence(text)
     return sentence.get_tokens()
+
+def get_unified_model_parameters(model_dir):
+    """Determine the actual parser and reranker model filesystem entries
+    for a unified parsing model. Returns a triple:
+
+    (parser_model_dir, reranker_features_filename,
+     reranker_weights_filename)
+
+    Any of these can be None if that part of the model is not present
+    on disk (though, if you have only one of the reranker model files,
+    the reranker will not be loaded).
+
+    A unified parsing model should have the following filesystem structure:
+
+    parser/
+        Charniak parser model: should contain pSgT.txt, *.g files
+        among others
+    reranker/
+        features.gz or features.bz2 -- features for reranker
+        weights.gz or weights.bz2 -- corresponding weights of those
+        features
+    """
+    if not exists(model_dir):
+        raise IOError("Model directory %r does not exist" % model_dir)
+
+    parser_model_dir = join(model_dir, 'parser')
+    if not exists(parser_model_dir):
+        parser_model_dir = None
+    reranker_model_dir = join(model_dir, 'reranker')
+
+    def get_reranker_model_filename(name):
+        filename = join(reranker_model_dir, '%s.gz' % name)
+        if not exists(filename):
+            # try bz2 version
+            filename = join(reranker_model_dir, '%s.bz2' % name)
+        if not exists(filename):
+            filename = None
+        return filename
+
+    features_filename = get_reranker_model_filename('features')
+    weights_filename = get_reranker_model_filename('weights')
+    return (parser_model_dir, features_filename, weights_filename)
      reranker_wrapper]]
 
 # what's with the -O0? well, using even the lowest levels of optimization
-# (gcc -O1) cause symbols to be inlined and disappear in _JohnsonReranker.so.
-# it's not clear how to fix this at this point.
+# (gcc -O1) causes one symbol which we wrap with SWIG to be inlined and
+# disappear in _JohnsonReranker.so which causes an ImportError.  this will
+# hopefully be addressed in the near future
 reranker_module = Extension('bllipparser._JohnsonReranker',
     sources=reranker_sources,
     extra_compile_args=['-iquote', reranker_base, '-O0'])
 
 setup(name='bllipparser',
-    version='2013.10.16-1',
+    version='2014.02.09',
     description='Python bindings for the BLLIP natural language parser',
     long_description='See http://pypi.python.org/pypi/bllipparser/',
-    author='David McClosky',
-    author_email='notsoweird+pybllipparser@gmail.com',
+    author='Eugene Charniak, Mark Johnson, David McClosky, many others',
+    maintainer='David McClosky',
+    maintainer_email='notsoweird+pybllipparser@gmail.com',
     classifiers=[
         'Development Status :: 4 - Beta',
         'Intended Audience :: Science/Research',