HTTPS SSH

Evaluation Script for DEFT Belief and Sentiment (BeSt) Annotations

This repository contains data structures and an evaluation tool for DEFT belief and sentiment annotations.

Downloading the Software

  • Using Git (preferred): $ git clone git@bitbucket.org:dbauer/best_evaluation.git. This requires that you have a bitbucket account and have uploaded your public ssh key.
  • Downloading the repository as a zip file: Go to https://bitbucket.org/dbauer/best_evaluation, in the "Navigation" panel on the left, click on "Downloads", then "Download repository".

Compatibility and Dependencies

The code is compatible with Python 2.7 and 3.x and has no external dependencies (except for the Python standard library). The scoring script best_evaluator.py requires that the best_evaluator/ directory be in the working directory or have its location specified in the PYTHONPATH environment variable.

Installation

If you use the scorer frequently, you may want to install it using: pip install . from the main folder. Installation makes the scripts best_evaluator and best_diagnostics globally available from any folder, and allows importing parts of the code as a library into your own code:

>>> from best_evaluator import read_best_xml

Running the Scoring Script

usage: best_evaluator.py [-h] [-p] [-s] [--no-null-sources] [-b] [-f] [-v]
                         ere_file gold_file source_file predict_file

Scorer for DEFT belief and sentiment annotations.

positional arguments:
  ere_file              rich ERE XML file or directory
  gold_file             gold belief and sentiment XML file or directory
  source_file           source file or directory
  predict_file          predicted belief and sentiment XML file or directory

optional arguments:
  -h, --help            show this help message and exit
  -p, --partial-provenance
                        give partial credit for provenance lists. If this flag
                        is not set full credit is given if a single mention in
                        the provenance list matches
  -s, --sentiment-only  score only sentiment annotations
  --no-null-sources     ignore belief and sentiment annotations with null
                        sources
  -b, --belief-only     score only belief annotations
  -f, --per-file        print per-file scores (batch mode only)
  -v, --verbose         show debugging output

The script can be run in single-file or in batch processing mode.

  • In single-file mode, the positional arguments must specify an ERE XML file, a gold BeSt XML file, and a predicted BeSt XML file. $ python best_evaluator.py ere_path/4f7eedf44076ea050d7db3715f9333fa.rich_ere.xml gold_path/4f7eedf44076ea050d7db3715f9333fa.best.xml source_path/4f7eedf44076ea050d7db3715f9333fa.xml predict_path/4f7eedf44076ea050d7db3715f9333fa.best.xml
  • In batch mode, the positional arguments specify three directories containing a set of ERE files, gold BeSt files, and predicted BeSt files. $ python best_evaluator.py ere_path gold_path source_path predict_path For each file in the gold directory that ends in .best.xml the script finds a corresponding ERE file in the ERE directory (same prefix, but ending in .rich_ere.xml) and predicted BeSt file in the directory for predicted files (identical filename as the gold annotation). In batch mode, the scoring script reports both micro and macro averaged results. The -f parameter can be specified to print per-file scores. Important: The batch evaluation assumes that the references (entities/events/relations/mentions) in each gold and predicted file exist in the (single) corresponding ERE file. In the LDC belief & sentiment data, some source files are split into multiple sections (for example ENG_DF_... files). The evaluation script is not aware of these splits. It considers each split separately as its own ERE/gold/prediction file triple.

Evalution Conditions (Provenance List)

There are two evaluation conditions:

  • By default, the provenance list of each private state tuple is scored as one-is-enough (mentioning a single object in the provenance list counts as full score).
  • When the -p flag is set, the scorer weights the full score by the F-score of the provenance list.

Scoring Belief and Sentiment Separately

By default, the script scores both belief and sentiment annotations and reports a single result. The parameter -b limits scoring to belief annotations (sentiment annotations in gold and prediction are ignored). The parameter -s limits scoring to sentiment annotations.

Belief and Sentiment Annotations with Missing Sources

The script evaluates sentiment and belief annotations without an explicit source. The predicted output is expected to contain such annotations without a 'source' attribute. The --no-null-sources parameter deactivates evaluation of annotations with missing sources, so that such annotations are ignored.

Diagnostics Script for DEFT Belief and Sentiment (BeSt) Annotations

Diagnostics script provides the scores of how well you do on detecting the correct source/target pairs (Unlike the Evaluation Script, it does not give any partial credits in any case). Also, it gives you the score for different mention types(entity, relation, and event). Third, you can see your accuracy of predicting the right belief/sentiment type when you get the source/target pairs right (RTS in the printing output).

Running the Diagnostics Script

usage: diagnostics.py [-s] [-b] gold_file_dir predict_file_dir

Diagnostics for DEFT belief and sentiment annotations only based on percentage of right source and target pairs

positional arguments:
  gold_file_dir            gold belief and sentiment XML directory
  predict_file_dir         predicted belief and sentiment XML directory

optional arguments:
-s, --sentiment-only  diagnose only sentiment annotations
-b, --belief-only     diagnose only belief annotations