PyNLP -- parsing and distributional similarity package for Python
PyNLP is a collection of Python/C/C++ modules that provide a common framework for discriminative parsing and distributional similarity-based clustering.
PyNLP is currently licensed for noncommercial purposes only. Special thanks to Helmut Schmid for allowing the inclusion of parts of BitPar and SFST in this distribution.
This distribution includes (parts of) the following software: BitPar (by Helmut Schmid; noncommercial only) http://www.ims.uni-stuttgart.de/tcl/SOFTWARE/BitPar.html SFST (by Helmut Schmid; GPL) http://www.ims.uni-stuttgart.de/projekte/gramotron/SOFTWARE/SFST.html FlexModule/BisonModule (Tommy McGuire; Pythonesque license) http://www.crsr.net/Software/FBModule.html Simple C++ wrapper for PCRE (Peter Petersen; BSD-like license) [different versions of this one floating around on the Internet.]
NOTE: the library modules that use BitPar and SFST are based on older versions of these programs and recent versions of these may have different functionality (in the case of BitPar) or use incompatible file formats (SFST). In particular, most of the functionality in the 'bitpar' C++ module that is beyond parsing bit vector charts (i.e., head finding, extraction of forests, discriminative parsing) is not present in BitPar, whereas reranking is present in newer versions of BitPar, but not in the module.
The following language resources can be used with PyNLP: - old SMOR/IMSLex (German morphology; proprietary)
cannot be redistributed
- SMOR with the IDS lexicon (German morphology) -- not done yet Note that this may yield slightly different results from using SMOR with the (non-redistributable) IMSLex lexicon.
- Morph-It (Italian morphology)
The easiest way to install this is by installing the necessary python packages via Debian/Ubuntu's apt-get and then setup the package in a virtualenv (so you can install new versions without having to modify the central python installation).
To actually do useful things, you need to untar the pytree_data.tgz tarball somewhere and point the PYNLP environment variable to that directory: export PYNLP=~/tmp/pytree_data
For building the software, you will need the following packages: gcc g++ python-dev libpcre++-dev python-numpy bison flex python-virtualenv (optional) cython (optional)
You will also need the dti decision tree package from Christian Borgelt: http://www.borgelt.net/doc/dtree/dtree.html
To build and setup everything: cd ~/tmp virtualenv parser_env cd ~/sources/pytree-package ~/tmp/parser_env/bin/python setup.py install cd tests ~/tmp/parser_env/bin/python test-all.py
Tu build and setup for a system installation, it's just python setup.py build sudo python setup.py install cd tests python test-all.py
After PyTree is installed, you can run "make html" in the "docs/" subdirectory to build API documentation in nice HTML format if you have installed the Sphinx document formatting suite.
Parsing with an existing grammar
To do lexicalized parsing with an existing grammar, use the script parse_lex.py, for example: python parse_lex.py -l de.tiger -g grammar/full -m grammar/all.model where:
- -l gives the settings bundle for PyNLP (which sets the right
- morphological prediction, head table, etc.)
-g gives the path to the files for the first stage PCFG grammar -m gives the path to the discriminative model parameters
Retraining the parser with an existing grammar
To retrain the parser, several steps are necessary: * Create the annotated trees from the treebank
for Tiger, this is done by the script tiger2mrg.py which will read a Negra export file, transform it including projectivizing the tree, and then write a bracketed file with edge label information.
- create base PCFG grammars for the n-fold crosstraining see scripts/parse_all.sh for how to do this.
- extract the discriminative training data for each fold:
- python train_lex.py -l de.tiger -g fold3 -o fold3/fold.event fold3/heldout.mrg
- (lather, rinse, repeat for fold1..fold5 and full)
- join the discriminative training data and apply frequency threshold:
amis_tools/treelexer -t 5 -l all.in -o all.event fold*/fold.event * run AMIS to estimate model parameters amis all.conf
- reward yourself with a cookie: you should now have - the PCFG grammar in .../full/ - the discriminative model in .../all.model
Adapting to other grammars or languages
basically, you want to have the following: * a script for treebank conversion and annotation
Look at tiger2mrg.py for an example how this could look. Basically, you would want to - have edge labels that (minimally) distinguish betweenmodifiers on one hand, different types of arguments on the other hand and conjuncts as well.
- markovize coordinations (look at the function with that name in tiger2mrg - basically, find coordinations and markovize them so that they are distinct from other phrases with linear markovization
- use generic_markovize to do markovization with GF marking which is head-based (if you have a head table).
- write everything out in VPF format (again, look at tiger2mrg.py how to do this).
- a head table (already used in step 1) look at pynlp/*/*_heads.py for ideas on how to do this.
- a POS prediction component look at pynlp/de/smor_pos.py and pynlp/it/morphit_pos.py for examples.
- a regex list look at data/german-regex.in and data/italian-regex.in for examples.
- one settings bundle to rule them all (or two) modify pynlp/__init__.py so that PyNLP can actually find all those components.