Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations
Pantel & Pennacchiotti, 2006
"Literally Espresso" project of the "Formal Semantics" course by Prof. Dr. Annette Frank at the Institute of Computational Linguistics, University of Heidelberg.
This project aims to extract new, reliable semantic relations from Wikipedia using the Espresso bootstrapping algorithm.
Dorothea Hoff & Maximilian Müller-Eberstein
Only the Python 3 standard libraries are required.
main class implementing the Espresso bootstrapping algorithm
container class for patterns and their instances
textual pattern for semantic relation extraction
cooccurence of two entities with a specific relational pattern
helper class for corpus data processing
CLI for running espresso
CLI for manual evaluation and espresso tasting
extracts documents and corresponding metadata from the TC-corpus
extracts a subcorpus from the full dataset
Extended documentation can be found in the code's doctrings.
- Pattern Induction
To specify the types of semntic relations to be extracted from the corpus, a few seed patterns and instances are provided. The reliability score of these is set to 1.
Using the seed-instances, new patterns are bootstrapped from the corpus. Should the two entities in an instance be found in a plausible structure close to each other, this inbetween structure is converted into a more general regular expression and stored as a new candidate pattern.
- Instance Acquisition
The next step involves the bootstrapping of new instances using patterns currently available to the algorithm. if two entities are found according to the pattern, they are stored as a new candidate instance.
Pattern- and Instance Ranking
To find the most reliable patterns, they are pruned according to their ranking. The reliability scores of instances and patterns are strongly correlated as it is calculated using a combination of the scores of the patterns used to bootstrap the candidate instances and vice versa. This follows the intuition that reliable patterns bootstrap reliable instances and that reliable instances are associated with many different reliable patterns.
To account for rare observations a discounting factor is also used. We furthermore altered the original algorithm to normalize the PMI-value to [0, 1] as to improve computational stability. Overall, the components making up the score are:
- rpi (pattern reliability) = (sum(instance_reliabilities * PMI with instances)/max(PMI))/count(extracted_instances)
- rt (instance reliability) = (sum(pattern_reliabilities * PMI with patterns)/max(PMI)/count(associated_patterns)
- PMI(pattern, instance) = log(count(pattern, instance)/(count(pattern) * count(instance))) * discount
- discount = ( count(pattern, instance) / (count(pattern, instance) + 1) ) * ( min(count(pattern, instance)) / (min(count(pattern, instance)) + 1) )
After ranking patterns by their reliability, their are pruned by keeping only the best k.
- Iteration Evaluation
The algorithm terminates when either no more than 5 new patterns are extracted or if the average reliability score of all patterns does not increase by more than 50%.
A more detailed description of the algorithm and the experiments can be found in
The corpus used for these experiments is the NYU's Wikipedia Corpus.
The semantic relations to be examined are 'cause of death' and 'notoriety'. Seeds used included the following:
'known\/VBN\ for' 'Christie/NNP', 'detective/NN novels/NN' 'Chaucer/NNP', 'metrical/JJ innovation/NN' 'Tolkien/NNP', 'The/DT Hobbit/NN' [...] 'appeared\/VBD\ as/IN\' 'Suchet/NNP', 'Inspector/NNP Japp/NNP' 'Spacey/NNP', 'Frank/NNP Underwood/NNP' 'Hodgman/NNP', 'The/DT Deranged/JJ Millionaire/NN' [...] 'died\/VBD\ of\/IN\' 'Chrysippus/NNP', 'laughter/NN' 'Houdini/NNP', 'peritonitis/NN' 'Whitfield/NNP', 'non-Hodgkin/JJ lymphoma/NN' [...] 'murdered\/VBN\ by\/IN\' 'Clytemnestra/NNP', 'Orestes/NNP' 'Agamemnon/NNP', 'Clytemnestra/NNP' 'Duncan/NNP', 'Macbeth/NNP'
The default parameters used in the experiments were:
- gain_threshold: 5
- improv_threshold: 1.5
- k_threshold: 42
Run the algorithm on a small sample dataset with default parameters:
$ python3 brew.py ../data/tst_corpus.txt ../data/tst_corpus_seeds.txt tst_corpus_res
To limit the number of iterations, simply add the corresponding flag:
$ python3 brew.py -iter 3 tst_corpus.txt tst_corpus_seeds.txt tst_corpus_res
To evaluate the extracted patterns directly afterwards, supply the
$ python3 brew.py -iter 3 -s tst_corpus.txt tst_corpus_seeds.txt tst_corpus_res
For help, simply run:
$ python3 brew.py -h
- Pantel, Patrick, and Marco Pennacchiotti. "Espresso: Leveraging generic patterns for automatically harvesting semantic relations." Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006.
- Ravichandran, Deepak, and Eduard Hovy. "Learning surface text patterns for a question answering system." Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2002.
- Cover, Thomas M., and Joy A. Thomas. "Entropy, relative entropy and mutual information." Elements of Information Theory (1991): 12-49.
- Pantel, Patrick, and Deepak Ravichandran. "Automatically Labeling Semantic Classes." HLT-NAACL. Vol. 4. 2004.
- Bouma, Gosse, and John Nerbonne. "Applying the espresso-algorithm to large parsed corpora." (2010).
- Hearst, Marti A. "Automatic acquisition of hyponyms from large text corpora." Proceedings of the 14th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1992.
- Bouma, Gerlof. "Normalized (pointwise) mutual information in collocation extraction." Proceedings of GSCL (2009): 31-40.