Source

WiktionaryIdioms / README.md

Wiktionary Idiom Classifier & Detector

Grace Muzny and Luke Zettlemoyer. Automatic Idiom Identification in Wiktionary. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013.

The Classifier

The Detector

Building

There is an accompanying ant file. From the WiktionaryIdioms directory, simply type ant or ant dist to build the distribution jar.

If you would like the runnable jars corresponding to classifier.experiments.RunClassifierExperimentFromFiles and detector.experiments.RunDetectorExperimentsFromFiles, these can be built by running the command ant runnables, and will be made in your WiktionaryIdioms/dist/ directory.

Alternatively, the jar files are all available in the Downloads section.

Running

Most of the classes with main files come with descriptions of all parameters that they need to be passed. Here is a description of two key classes and how to run them.

All main classes will work with 4g of memory allocated. (-Xmx4g) Most require at least this much memory.

RunClassifierExperimentFromFiles

This experiment takes two arguments from the command line:

  • <type> : The type of classifier experiment to run. Choose from "basic", "grid", "compare", or "comparegroups".

  • <config_path> : The path to config/classifierconfig.xml. Depending on the type of experiment you are running, different fields will be drawn from for the config file.

Example:

$ java -Xmx4g -jar dist/RunClassifierExperimentFromFiles-1.0.jar basic ./config/classifierconfig.xml

RunDetectorExperimentFromFiles

This experiment takes four (optionally five) arguments from the command line:

  • <detector_experiment> : The type of detector experiment to run. Choose from "dummy" or "goldenlabels". "goldenlabels" experiments will not currently work without MySQL database support. The "dummy" experiment is most likely what you want. It simply applies the <detector_method> to disambiguate the senses, then uses the classifier model at <classifer_model_path> on the disambiguated senses.

  • <detector_method> : The method to disambiguate with. Choose from "baseline", "baselinefirst", "baselinerandom", "lesk", or "elesk" (needs to have access to WordNet).

  • <config_path> : The path to config/nodbconfig.xml. Depending on the type of experiment you are running, different fields will be drawn from for the config file.

  • <classifier_model_path> : The path to the classifier model that you wish to use. This file should be produced by RunClassifierExperimentFromFiles.

Example:

$ java -Xmx4g -jar dist/RunDetectorExperimentFromFiles-1.0.jar dummy lesk ./config/nodbconfig.xml ./models-dir/basicperceptron.model 

Data

MySQL

This work was originally conducted using a series of MySQL databases.

Eclipse

To work on the project in Eclipse, simply download and import the project WiktionaryIdioms into it.