Source

WiktionaryIdioms / README.md

Full commit

Wiktionary Idiom Classifier & Detector

Grace Muzny and Luke Zettlemoyer. Automatic Idiom Identification in Wiktionary. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013.

The classifier and detector interaction and design are described in the above paper, as well as experimental results.

If you use the code or data released here, you should cite the above paper.

Building

There is an accompanying ant file. From the WiktionaryIdioms directory, simply type ant or ant dist to build the distribution jar.

If you would like the runnable jars corresponding to classifier.experiments.RunClassifierExperimentFromFiles and detector.experiments.RunDetectorExperimentsFromFiles, these can be built by running the command ant runnables, and will be made in your WiktionaryIdioms/dist/ directory.

Alternatively, the jar files are all available in the Downloads section.

Running

Most of the classes with main files come with descriptions of all parameters that they need to be passed. Here is a description of two key classes and how to run them.

All main classes will work with 4g of memory allocated. (-Xmx4g) Most require at least this much memory.

RunClassifierExperimentFromFiles

This experiment takes two arguments from the command line:

  • <type> : The type of classifier experiment to run. Choose from "basic", "grid", "compare", or "comparegroups".

  • <config_path> : The path to classifierconfig.xml. Depending on the type of experiment you are running, different fields will be drawn from for the config file. Descriptions of fields are located in config/classifierconfig.xml.

Example:

$ java -Xmx4g -jar dist/RunClassifierExperimentFromFiles-1.0.jar basic ./config/classifierconfig.xml

RunDetectorExperimentFromFiles

This experiment takes four (optionally five) arguments from the command line:

  • <detector_experiment> : The type of detector experiment to run. Choose from "dummy" or "goldenlabels". "goldenlabels" experiments will not currently work without MySQL database support. The "dummy" experiment is most likely what you want. It simply applies the <detector_method> to disambiguate the senses, then uses the classifier model at <classifer_model_path> on the disambiguated senses.

  • <detector_method> : The method to disambiguate with. Choose from "baseline", "baselinefirst", "baselinerandom", "lesk", or "elesk" (needs to have access to WordNet).

  • <config_path> : The path to nodbconfig.xml. Depending on the type of experiment you are running, different fields will be drawn from for the config file. Descriptions of fields are located in config/nodbconfig.xml.

  • <classifier_model_path> : The path to the classifier model that you wish to use. This file should be produced by RunClassifierExperimentFromFiles.

  • <number_of_times> : (optional) A number of times that you would like to run this experiment (default is 1). Useful if you are using the "baselinerandom" detector method.

Example:

$ java -Xmx4g -jar dist/RunDetectorExperimentFromFiles-1.0.jar dummy lesk ./config/nodbconfig.xml ./models/filename.model

Data

The data is located in the downloads section. Statistics on it are described in the paper referenced at the top of this file.

The data download holds the following files:

  • allsenses.txt - Holds all sense data gathered via JWKTL from the November 13th, 2012 english wiktionary dump and computed values for the features described in classifier.features.numeric.Feature.

  • [test|dev]_unannotated.txt - The same as allsenses.txt, but only for the senses in the test or development data set.

  • [test|dev]_annotated.txt - The same as [test|dev]_unannotated.txt, but with annotated labels. All data were annotated in accordance with the AnnotationGuidlines.pdf file that is also in the data download zip.

  • [test|dev]_[un]annotated_nofeatures.txt - The same as the corresponding files, but with no computed feature values.

Reproducing Paper Results

The configuration files are set up such that all you have to do to reproduce the paper results for the Annotated Lexical+Graph classifier is run the following commands from the WiktionaryIdioms directory:

$ mkdir data   # move the data sets you get from the download into this directory
$ ant runnables
$ mkdir models
$ java -Xmx4g -jar dist/RunClassifierExperimentFromFiles-1.0.jar basic ./config/classifierconfig.xml  # will produce two files - "filename" and "/models/filename.model"
$ java -Xmx4g -jar dist/RunDetectorExperimentFromFiles-1.0.jar dummy lesk ./config/nodbconfig.xml ./models/filename.model

MySQL

This work was originally conducted using a series of MySQL databases. Instructions on the setup of databases are coming soon!

Eclipse

To work on the project in Eclipse, simply download and import the project into your workspace.