* Copyright 1999, 2000, 2001, 2005, 2006 Brown University, Providence, RI.
* Licensed under the Apache License, Version 2.0 (the "License"); you may
* not use this file except in compliance with the License. You may obtain
* a copy of the License at
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
* WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
* License for the specific language governing permissions and limitations
* under the License.
1. Basic Usage
The parser (which is to be found in the sub-directory PARSE) expects
sentences delimited by <s> ... </s>, and outputs the parsed versions in
Penn treebank style. The <s> and </s> must be separated by spaces
from all other text. So if the input is
<s> (``He'll work at the factory.'') </s>
the output will be (to cout):
(S1 (PRN (-LRB- -LRB-) (S (`` ``) (NP (PRP He)) (VP (MD 'll) (VP (VB work) (PP (IN at) (NP (DT the) (NN factory))))) (. .) ('' '')) (-RRB- -RRB-)))
If you want to make it slightly easier for humans to read, use the command
line argument -P (for pretty print), in which case you will get:
(S1 (PRN (-LRB- -LRB-)
(S (`` ``)
(NP (PRP He))
(VP (MD 'll)
(VP (VB work) (PP (IN at) (NP (DT the) (NN factory)))))
The parser will take input from either cin, or, if given the name
of a file, from that file. So in the latter case the call to
the parser would be:
shell> parseIt <path to directory with parsing statistics> <file of sentences>
shell> parseIt ../DATA/EN/ testSentence
(Note that as the parser now has three separate DATA files, one each
for English, Chinese, and English Language Modeling, the DATA directories
have been made separate subdirectories under the directory DATA).
As indicated above, the parser will first tokenize the input.
If you do NOT want to to tokenize (for some reason you are
handing it pretokenized input, as you would do if you were
testing it's performance on the tree-bank), give it a -K option.
2. Compilation instructions
shell> cd TRAIN
shell> make all
shell> cd ..
shell> cd PARSE
shell> make all
3. N-best Parsing
This version of the parser can produce multiple-best parses. So if
you what 50 alternative parses rather than just one, just add -N50
to the command line.
In multiparse mode the output format is slightly different;
The sentence indicator string will typically just a sentence number.
However, if the input to the parser is of the form <s sentence-id >
... </s>, then the sentence-id proveded will be used instead. This is
useful if, e.g., you want to know where article boundaries are.
4. Other options
-S tells the parser to remain silent when it cannot parse a sentence
(it just goes on to the next one.
The parser can now parse Chinese. It requires that the Chinese
characters already be grouped into words. Assuming you have
trained on the Chinese Tree-bank from LDC (see the README for
the TRAIN programs), you tell the parser to be expecting Chinese
by giving it the command-line option -LCh. (The default is
English, which is also be specified by -LEn .) The files you
need to train Chinese are in DATA/CH/.
The parser will ignore any sentence consisting of > 100 words +
punctuation. To change this to, say 200 you give it the on-line
The parser is set to be case sensitive. To make it case insensitive
add the command-line flat -C .
Currently there are various array sizes that make 400 the absolute
maximum sentence length. To allow for longer sentences change (in
#define MAXSENTLEN 400
Similarly to allow for a larger dictionary of words from training increase
#define MAXNUMWORDS 50000
To see debugging information give it the on-line argument -d#
where # is a number > 10. As the numbers get larger, the verbosity of
the information increases.
There is a subdirectory TRAIN which contains the programs used to
collect the statistics the parser requires from tree-bank data. As
the parser comes with the statistics it needs you will only need this
if you want to try experiments with the parser on more (or less, or
different) tree-bank data. For more information see the README file
6. Language Modeling
To use the parser as the language model described in Charniak 2001
(Proceedings of ACL) you must first retrain the data using the
settings found in DATA/LM/.
Then give parseIt a -M command-line argument. If the data is from
speech, and thus all one case, also use the -C flag.
The output in -M mode is of the form:
log-grammar-probability log-trigram-probability log-mixed-probability
Again, if the data is from speech and has a limited vocabulary, it
will often be the case that the parser will have a very difficult time
finding a parse because of incorrect words (or, in simulated speech
output, the presence of "unk" the unknown word replacement), and there
will be many parses with equally bad probabilities. In such cases the
pruning that keeps memory in bounds for 50-best parsing fails. So
just use 1-best, or maybe 10 best.
7. Faster Parsing
The defaulty speed/accuracy setting should give you the results in the
published papers. It is, however, easy to get faster parsing at the
expense of some accuracy. So a command-line argument of -T50 costs
you about a percent of parsing accuracy, but rather than 1.4
sentences/second you will get better than 6 sentences/second. (The
default is -T210.)
8. Multi-threaded version
[Update 2013] Using multiple threads is not currently recommended as
there appear to be thread safety issues.
parseIt is multi threaded. It currently assumes two threads (for dual
processors). To change this, use the command line argument, -t4 to
have it use, e,g, 4 threads. Currently the maximum number of threads
allowed is 4. To change this change the following line in Features.h
and recompile parseIt.
#define MAXNUMTHREADS 4
evalTree takes penn-treebank parse trees from cin, and outputs to cout
for each tree, one per line.
shell> evalTree <path to directory with parsing statistics>
If the tree is assigned zero probability it returns 0 for the log2
For reasons that would take us too far afield, about 13% of the time
it returns a probability that is too high. If you want to be warned
when it is doing this, give evalTree a -W command-line argument and
the output will have an "!" at the end of the line.
10. Parsing from tagged input
This can now be done using a command such as the following:
shell> parseIt -K -Einput.tags <model dir/> input.sgml
and input.sgml looks like this:
<s> This is a test sentence . </s>
and input.tags looks like this:
Each token is given a list of zero or more tags and sentences are separated by "---". If a token is given zero tags, the standard tagging mechanism will be employed. If a token is given multiple tags, they will each be considered.
Note that the tokenization must match exactly between these files (tokens are space-separated in input.sgml). To ensure that tokenization matches, you should pretokenize your input and supply the -K flag.
11. Frequently confusing errors
a. If parser provides no output at all:
This is most likely caused by not having spaces around the <s> and </s> brackets, i.e.
<s>This is a test sentence.</s>
<s> This is a test sentence. </s>
b. When retraining: "Couldn't find term: _____
pSgT: InputTree.C:206: InputTree* InputTree::newParse(std::istream&, int&, InputTree*): Assertion `Term::get(trm)' failed."
This means the training data contains an unknown term (phrasal or
part of speech type). You'll need to add the appropriate entry to
terms.txt in the model you're training.