Evaluation Script for DEFT Belief and Sentiment (BeSt) Annotations
This repository contains data structures and an evaluation tool for DEFT belief and sentiment annotations.
Downloading the Software
- Using Git (preferred):
$ git clone firstname.lastname@example.org:dbauer/best_evaluation.git. This requires that you have a bitbucket account and have uploaded your public ssh key.
- Downloading the repository as a zip file: Go to https://bitbucket.org/dbauer/best_evaluation, in the "Navigation" panel on the left, click on "Downloads", then "Download repository".
Compatibility and Dependencies
The code is compatible with Python 2.7 and 3.x and has no external dependencies (except for the Python standard library).
The scoring script best_evaluator.py requires that the
best_evaluator/ directory be in the working directory or have its location
specified in the PYTHONPATH environment variable.
If you use the scorer frequently, you may want to install it using:
pip install . from the main folder. Installation makes the scripts
best_diagnostics globally available from any folder,
and allows importing parts of the code as a library into your own code:
>>> from best_evaluator import read_best_xml
Running the Scoring Script
usage: best_evaluator.py [-h] [-p] [-s] [--no-null-sources] [-b] [-f] [-v] ere_file gold_file source_file predict_file Scorer for DEFT belief and sentiment annotations. positional arguments: ere_file rich ERE XML file or directory gold_file gold belief and sentiment XML file or directory source_file source file or directory predict_file predicted belief and sentiment XML file or directory optional arguments: -h, --help show this help message and exit -p, --partial-provenance give partial credit for provenance lists. If this flag is not set full credit is given if a single mention in the provenance list matches -s, --sentiment-only score only sentiment annotations --no-null-sources ignore belief and sentiment annotations with null sources -b, --belief-only score only belief annotations -f, --per-file print per-file scores (batch mode only) -v, --verbose show debugging output
The script can be run in single-file or in batch processing mode.
- In single-file mode, the positional arguments must specify an ERE XML file, a gold BeSt XML file, and a predicted BeSt XML file.
$ python best_evaluator.py ere_path/4f7eedf44076ea050d7db3715f9333fa.rich_ere.xml gold_path/4f7eedf44076ea050d7db3715f9333fa.best.xml source_path/4f7eedf44076ea050d7db3715f9333fa.xml predict_path/4f7eedf44076ea050d7db3715f9333fa.best.xml
- In batch mode, the positional arguments specify three directories containing a set of ERE files, gold BeSt files, and predicted BeSt files.
$ python best_evaluator.py ere_path gold_path source_path predict_pathFor each file in the gold directory that ends in .best.xml the script finds a corresponding ERE file in the ERE directory (same prefix, but ending in .rich_ere.xml) and predicted BeSt file in the directory for predicted files (identical filename as the gold annotation). In batch mode, the scoring script reports both micro and macro averaged results. The -f parameter can be specified to print per-file scores. Important: The batch evaluation assumes that the references (entities/events/relations/mentions) in each gold and predicted file exist in the (single) corresponding ERE file. In the LDC belief & sentiment data, some source files are split into multiple sections (for example ENG_DF_... files). The evaluation script is not aware of these splits. It considers each split separately as its own ERE/gold/prediction file triple.
Evalution Conditions (Provenance List)
There are two evaluation conditions:
- By default, the provenance list of each private state tuple is scored as one-is-enough (mentioning a single object in the provenance list counts as full score).
- When the -p flag is set, the scorer weights the full score by the F-score of the provenance list.
Scoring Belief and Sentiment Separately
By default, the script scores both belief and sentiment annotations and reports a single result. The parameter -b limits scoring to belief annotations (sentiment annotations in gold and prediction are ignored). The parameter -s limits scoring to sentiment annotations.
Belief and Sentiment Annotations with Missing Sources
The script evaluates sentiment and belief annotations without an explicit source. The predicted output is expected to contain such annotations without a 'source' attribute. The --no-null-sources parameter deactivates evaluation of annotations with missing sources, so that such annotations are ignored.
Diagnostics Script for DEFT Belief and Sentiment (BeSt) Annotations
Diagnostics script provides the scores of how well you do on detecting the correct source/target pairs (Unlike the Evaluation Script, it does not give any partial credits in any case). Also, it gives you the score for different mention types(entity, relation, and event). Third, you can see your accuracy of predicting the right belief/sentiment type when you get the source/target pairs right (RTS in the printing output).
Running the Diagnostics Script
usage: diagnostics.py [-s] [-b] gold_file_dir predict_file_dir Diagnostics for DEFT belief and sentiment annotations only based on percentage of right source and target pairs positional arguments: gold_file_dir gold belief and sentiment XML directory predict_file_dir predicted belief and sentiment XML directory optional arguments: -s, --sentiment-only diagnose only sentiment annotations -b, --belief-only diagnose only belief annotations