# bllip-parser /

Filename Size Date modified Message
first-stage
python/bllipparser
sample-text
second-stage
1.8 KB
164 B
11.1 KB
288 B
22.8 KB
19.9 KB
19.9 KB
791 B
9.1 KB
3.2 KB
1.3 KB
3.3 KB
3.4 KB
864 B
869 B
1.1 KB
864 B
16.8 KB
3.7 KB
3.1 KB

1. Mark Johnson,Eugene Charniak, 24th November 2005 --- August 2006

We request acknowledgement in any publications that make use of this software and any code derived from this software. Please report the release date of the software that you are using (this is part of the name of the tar file you have downloaded), as this will enable others to compare their results to yours.

NEW!!! The first stage parser, which uses about 95% of the time, is now multi-treaded. The default is two threads. Currently the maximum is four. To change the number (or maximum) see the README file for the first-stage parser. For the time being a non-threaded version is available in case there are problems with threads. Send email to ec if you have problems. See below for details.

## COMPILING THE PARSER

To compile the two-stage parser, first define GCCFLAGS appropriately for your machine, e.g., with csh or tcsh

> setenv GCCFLAGS "-march=pentium4 -mfpmath=sse -msse2 -mmmx"

or

> setenv GCCFLAGS "-march=opteron -m64"

(if unsure, it is safe to leave GCCFLAGS unset -- it just won't be as optimized as possible)

Then execute

make

After it has built, the parser can be run with

> parse.sh <sourcefile.txt>

E.g.,

> parse.sh sample-text/sample-data.txt

The script parse-eval.sh takes a list of treebank files as arguments and extracts the terminal strings from them, runs the two-stage parser on those terminal strings and then evaluates the parsing accuracy with the Sparseval program[*]. For example, on my machine the Penn Treebank 3 CD-ROM is installed at /usr/local/data/Penn3/, so the following code evaluates the two-stage parser on section 24.

> parse-eval.sh /usr/local/data/Penn3/parsed/mrg/wsj/24/wsj*.mrg

[*] Sparseval is available from
http://old-site.clsp.jhu.edu/ws2005/groups/eventdetect/files/SParseval.tgz

title={SParseval: Evaluation metrics for parsing speech}, author={Roark, Brian and Harper, Mary and Charniak, Eugene and Dorr, Bonnie and Johnson, Mark and Kahn, Jeremy G and Liu, Yang and Ostendorf, Mari and Hale, John and Krasnyanskaya, Anna and others}, booktitle={Proc. LREC}, year={2006}

}

We no longer distribute evalb with the parser, but it is still available: http://nlp.cs.nyu.edu/evalb/

## TRAINING THE RERANKER

Retraining the reranker takes a considerable amount of time, disk space and RAM. At Brown we use a dual Opteron machine with 16Gb RAM, and it takes around two days. You should be able to do it with only 8Gb RAM, and maybe even with 4Gb RAM with an appropriately tweaked kernel (e.g., sysctl overcommit_memory, and a so-called 4Gb/4Gb split if you're using a 32-bit OS).

The time and memory you need depend on the features that the reranker extracts and the size of the n-best tree training and development data. You can change the features that are extracted by changing second-stage/programs/features/features.h, and you can reduce the size of the n-best tree data by reducing NPARSES in the Makefile from 50 to, say, 25.

You will need to edit the Makefile in order to retrain the reranker.

First, you need to set the variable PENNWSJTREEBANK in Makefile to the directory that holds your version of the Penn WSJ Treebank. On my machine this is:

PENNWSJTREEBANK=/usr/local/data/Penn3/parsed/mrg/wsj/

On estimators: There are multiple estimators one can use when retraining the reranker. cvlm and cvlm-owlqn are the main ones of interest. cvlm-owlqn is significantly faster than cvlm but unfortunately has been removed from this distribution due to licensing conflicts (but see the license section at the bottom of this file).

If you're using cvlm as your estimator (the default), you'll also need the Boost C++ and the Petsc/Tao C++ libraries in order to retrain the reranker. If you're using cvlm-owlqn as your estimator, you can ignore these steps. Install instructions for Petsc/Tao are given later in this document. The environment variables PETSC_DIR and TAO_DIR should all point to the installation directories of this software. I define these variables in my .login file as follows on my machine.

setenv PETSC_DIR /usr/local/share/petsc setenv TAO_DIR /usr/local/share/tao setenv PETSC_ARCH linux setenv BOPT O_c++

While many modern Linux distributions come with the Boost C++ libraries pre-installed, if the Boost C++ libraries are not included in your standard libraries and headers, you will need to install them and add an include file specification for them in your GCCFLAGS. For example, if you have installed the Boost C++ libraries in /home/mj/C++/boost, then your GCCFLAGS environment variable should be something like:

> setenv GCCFLAGS "-march=pentium4 -mfpmath=sse -msse2 -mmmx -I /home/mj/C++/boost"

or

> setenv GCCFLAGS "-march=opteron -m64 -I /home/mj/C++/boost"

Once this is set up, you retrain the reranker as follows:

> make reranker > make nbesttrain > make eval-reranker

The script train-eval-reranker.sh does all of this.

The reranker goal builds all of the programs, nbesttrain constructs the 20 folds of n-best parses required for training, and eval-reranker extracts features, estimates their weights and evaluates the reranker's performance on the development data (dev) and the two test data sets (test1 and test2).

If you have a parallel processor, you can run 2 (or more) jobs in parallel by running

> make -j 2 nbesttrain

Currently this only helps for nbesttrain (but this is the slowest step, so maybe this is not so bad).

The Makefile contains a number of variables that control how the training process works. The most important of these is the VERSION variable. You should do all of your experiments with VERSION=nonfinal, and only run with VERSION=final once to produce results for publication.

If VERSION is nonfinal then the reranker trains on WSJ PTB sections 2-19, sections 20-21 are used for development, section 22 is used as test1 and section 24 is used as test2 (this approximately replicates the Collins 2000 setup).

If VERSION is final then the reranker trains on WSJ PTB sections 2-21, section 24 is used for development, section 22 is used as test1 and section 23 is used as test2.

The Makefile also contains variables you may want to change, such as NBEST, which specfies how many parses per sentence are extracted from each sentence, and NFOLDS, which specifies how many folds are created.

If you decide to experiment with new features or new feature weight estimators, take a close look at the Makefile. If you change the features please also change FEATURESNICKNAME; this way your new features won't over-write our existing ones. Similarly, if you change the feature weight estimator please pick a new ESTIMATORNICKNAME and if you change the n-best parser please pick a new NBESTPARSERNICKNAME; this way you new n-best parses or feature weights won't over-write the existing ones.

To get rid of (many of) the object files produced in compilation, run:

> make clean

Training, especially constructing the 20 folds of n-best parses, produces a lot of temporary files which you can remove if you want to. To remove the temporary files used to construct the 20 fold n-best parses, run:

> make nbesttrain-clean

All of the information needed by the reranker is in second-stage/models. To remove everything except the information needed for running the reranking parser, run:

> make train-clean

To clean up everything, including the data needed for running the reranking parser, run:

> make real-clean

To use the non-threaded parser instead change the following line in the Makefile

NBESTPARSER=first-stage/PARSE/parseIt

That is, it is identical except for the "o" in oparseIt

Then run oparse.sh, rather than parse.sh.

## INSTALLING PETSC AND TAO

If you're using cvlm as your estimator, you'll need to have PETSc and Tao installed in order to retrain the reranker. Otherwise, you can safely ignore this section.

These installation instructions work for gcc version 4.2.1 (you also need g++ and gfortran).

1. Unpack PETSc and TAO somewhere, and make shell variables point to those directories (put the shell variable definitions in your .bash_profile or equivalent)

export PETSC_DIR=/usr/local/share/petsc export TAO_DIR=/usr/local/share/tao export PETSC_ARCH="linux" export BOPT=O_c++

cd /usr/local/share ln -s petsc-2.3.3-p6 petsc ln -s tao-1.9 tao

1. Configure and build PETSc

cd petsc FLAGS="-march=native -mfpmath=sse -msse2 -mmmx -O3 -ffast-math" ./config/configure.py --with-cc=gcc --with-fc=gfortran --with-cxx=g++ --download-f-blas-lapack=1 --with-mpi=0 --with-clanguage=C++ --with-shared=1 --with-dynamic=1 --with-debugging=0 --with-x=0 --with-x11=0 COPTFLAGS=$FLAGS FOPTFLAGS=$FLAGS CXXOPTFLAGS=\$FLAGS make all

1. Configure and build TAO

cd ../tao make all