Wiktionary Idiom Classifier & Detector
Grace Muzny and Luke Zettlemoyer. Automatic Idiom Identification in Wiktionary. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2013.
There is an accompanying ant file. From the
WiktionaryIdioms directory, simply type
ant dist to build the distribution jar.
If you would like the runnable jars corresponding to
detector.experiments.RunDetectorExperimentsFromFiles, these can be built by running the command
ant runnables, and will be made in your
Alternatively, the jar files are all available in the Downloads section.
Most of the classes with main files come with descriptions of all parameters that they need to be passed. Here is a description of two key classes and how to run them.
All main classes will work with 4g of memory allocated. (
-Xmx4g) Most require at least this much memory.
This experiment takes two arguments from the command line:
<type> : The type of classifier experiment to run. Choose from "basic", "grid", "compare", or "comparegroups".
<config_path> : The path to
classifierconfig.xml. Depending on the type of experiment you are running, different fields will be drawn from for the config file. Descriptions of fields are located in
$ java -Xmx4g -jar dist/RunClassifierExperimentFromFiles-1.0.jar basic ./config/classifierconfig.xml
This experiment takes four (optionally five) arguments from the command line:
<detector_experiment> : The type of detector experiment to run. Choose from "dummy" or "goldenlabels". "goldenlabels" experiments will not currently work without MySQL database support. The "dummy" experiment is most likely what you want. It simply applies the <detector_method> to disambiguate the senses, then uses the classifier model at <classifer_model_path> on the disambiguated senses.
<detector_method> : The method to disambiguate with. Choose from "baseline", "baselinefirst", "baselinerandom", "lesk", or "elesk" (needs to have access to WordNet).
<config_path> : The path to
nodbconfig.xml. Depending on the type of experiment you are running, different fields will be drawn from for the config file. Descriptions of fields are located in
<classifier_model_path> : The path to the classifier model that you wish to use. This file should be produced by RunClassifierExperimentFromFiles.
<number_of_times> : (optional) A number of times that you would like to run this experiment (default is 1). Useful if you are using the "baselinerandom" detector method.
$ java -Xmx4g -jar dist/RunDetectorExperimentFromFiles-1.0.jar dummy lesk ./config/nodbconfig.xml ./models-dir/basicperceptron.model
The data is located in the downloads section. Statistics on it are described in the paper referenced at the top of this file.
The data download holds the following files:
allsenses.txt - Holds all sense data gathered via JWKTL from the November 12th, 2012 english wiktionary dump and computed values for the features described in classifier.features.numeric.Feature.
[test|dev]_unannotated.txt - The same as allsenses.txt, but only for the senses in the test or development data set.
[test|dev]_annotated.txt - The same as [test|dev]_unannotated.txt, but with annotated labels. All data were annotated in accordance with the
AnnotationGuidlines.pdffile that is also in the data download zip.
[test|dev]_[un]annotated_nofeatures.txt - The same as the corresponding files, but with no computed feature values.
This work was originally conducted using a series of MySQL databases. Instructions on the setup of databases are coming soon!
To work on the project in Eclipse, simply download and import the project into your workspace.