HTTPS SSH

Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations

Pantel & Pennacchiotti, 2006

"Literally Espresso" project of the "Formal Semantics" course by Prof. Dr. Annette Frank at the Institute of Computational Linguistics, University of Heidelberg. This project aims to extract new, reliable semantic relations from Wikipedia using the Espresso bootstrapping algorithm.

Dorothea Hoff & Maximilian Müller-Eberstein

Requirements

Only the Python 3 standard libraries are required.

Modules

  • espresso
    main class implementing the Espresso bootstrapping algorithm
  • pattern_collection
    container class for patterns and their instances
  • pattern
    textual pattern for semantic relation extraction
  • instance
    cooccurence of two entities with a specific relational pattern
  • corpus_reader
    helper class for corpus data processing
  • brew
    CLI for running espresso
  • sommelier
    CLI for manual evaluation and espresso tasting
  • tools
    • ext_docs
      extracts documents and corresponding metadata from the TC-corpus
    • ext_subcorpus
      extracts a subcorpus from the full dataset

Extended documentation can be found in the code's doctrings.

Espresso Algorithm

  1. Pattern Induction
    To specify the types of semntic relations to be extracted from the corpus, a few seed patterns and instances are provided. The reliability score of these is set to 1.
  2. Patternacquisition
    Using the seed-instances, new patterns are bootstrapped from the corpus. Should the two entities in an instance be found in a plausible structure close to each other, this inbetween structure is converted into a more general regular expression and stored as a new candidate pattern.
  3. Instance Acquisition
    The next step involves the bootstrapping of new instances using patterns currently available to the algorithm. if two entities are found according to the pattern, they are stored as a new candidate instance.
  4. Pattern- and Instance Ranking
    To find the most reliable patterns, they are pruned according to their ranking. The reliability scores of instances and patterns are strongly correlated as it is calculated using a combination of the scores of the patterns used to bootstrap the candidate instances and vice versa. This follows the intuition that reliable patterns bootstrap reliable instances and that reliable instances are associated with many different reliable patterns.
    To account for rare observations a discounting factor is also used. We furthermore altered the original algorithm to normalize the PMI-value to [0, 1] as to improve computational stability. Overall, the components making up the score are:

    • rpi (pattern reliability) = (sum(instance_reliabilities * PMI with instances)/max(PMI))/count(extracted_instances)
    • rt (instance reliability) = (sum(pattern_reliabilities * PMI with patterns)/max(PMI)/count(associated_patterns)
    • PMI(pattern, instance) = log(count(pattern, instance)/(count(pattern) * count(instance))) * discount
    • discount = ( count(pattern, instance) / (count(pattern, instance) + 1) ) * ( min(count(pattern, instance)) / (min(count(pattern, instance)) + 1) )
  5. Pattern Pruning
    After ranking patterns by their reliability, their are pruned by keeping only the best k.

  6. Iteration Evaluation
    The algorithm terminates when either no more than 5 new patterns are extracted or if the average reliability score of all patterns does not increase by more than 50%.

A more detailed description of the algorithm and the experiments can be found in presentation/espresento.pdf.

Usage

The corpus used for these experiments is the NYU's Wikipedia Corpus.

The semantic relations to be examined are 'cause of death' and 'notoriety'. Seeds used included the following:

'known\/VBN\ for'
  'Christie/NNP', 'detective/NN novels/NN'
  'Chaucer/NNP', 'metrical/JJ innovation/NN'
  'Tolkien/NNP', 'The/DT Hobbit/NN'
  [...]
'appeared\/VBD\ as/IN\'
  'Suchet/NNP', 'Inspector/NNP Japp/NNP'
  'Spacey/NNP', 'Frank/NNP Underwood/NNP'
  'Hodgman/NNP', 'The/DT Deranged/JJ Millionaire/NN'
  [...]
'died\/VBD\ of\/IN\'
  'Chrysippus/NNP', 'laughter/NN'
  'Houdini/NNP', 'peritonitis/NN'
  'Whitfield/NNP', 'non-Hodgkin/JJ lymphoma/NN'
  [...]
'murdered\/VBN\ by\/IN\'
  'Clytemnestra/NNP', 'Orestes/NNP'
  'Agamemnon/NNP', 'Clytemnestra/NNP'
  'Duncan/NNP', 'Macbeth/NNP'

The default parameters used in the experiments were:

  • gain_threshold: 5
  • improv_threshold: 1.5
  • k_threshold: 42

Run the algorithm on a small sample dataset with default parameters:

$ python3 brew.py ../data/tst_corpus.txt ../data/tst_corpus_seeds.txt tst_corpus_res

To limit the number of iterations, simply add the corresponding flag:

$ python3 brew.py -iter 3 tst_corpus.txt tst_corpus_seeds.txt tst_corpus_res

To evaluate the extracted patterns directly afterwards, supply the -s flag:

$ python3 brew.py -iter 3 -s tst_corpus.txt tst_corpus_seeds.txt tst_corpus_res

For help, simply run:

$ python3 brew.py -h

References

  1. Pantel, Patrick, and Marco Pennacchiotti. "Espresso: Leveraging generic patterns for automatically harvesting semantic relations." Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2006.
  2. Ravichandran, Deepak, and Eduard Hovy. "Learning surface text patterns for a question answering system." Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 2002.
  3. Cover, Thomas M., and Joy A. Thomas. "Entropy, relative entropy and mutual information." Elements of Information Theory (1991): 12-49.
  4. Pantel, Patrick, and Deepak Ravichandran. "Automatically Labeling Semantic Classes." HLT-NAACL. Vol. 4. 2004.
  5. Bouma, Gosse, and John Nerbonne. "Applying the espresso-algorithm to large parsed corpora." (2010).
  6. Hearst, Marti A. "Automatic acquisition of hyponyms from large text corpora." Proceedings of the 14th conference on Computational linguistics-Volume 2. Association for Computational Linguistics, 1992.
  7. Bouma, Gerlof. "Normalized (pointwise) mutual information in collocation extraction." Proceedings of GSCL (2009): 31-40.

Happy brewing!