Codeine implements models for retrieving Java methods using English queries. The implemented models include: term-matching model, Polylingual Latent Dirichlet Allocation(PLDA) model, and IBM model 1.


The directory data/jel contains machine-readable XML documentation of several packages of the Java standard library.

The directory data/extract contains data which has been extracted from the XML files, pre-processed and split into train/validation/test set. These files are in plain text format. Refer to the README.rst file in data/extract for details.


The directory bin contains scripts files for running each model.

In order to run an experiment, type the following commands in the console, from the codeine directory:


Replace "MODEL_NAME" by the name of the model. There are three models implemented:

  • term-matching: baseline model
  • plda: Polylingual Latent Dirichlet Allocation
  • ibm: IBM model 1

(Note: to run the experiements of plda model, "Mallet" folder needs to be located in the same path as codeine. E.g. /user/home/codeine; /user/home/mallet.)

After each successful run, there will be an output folder assigned a name with the model name and the parameters. The output folder contains the generated intermediate files while the corresponding Mean Reciprocal Rank scores can be found in the directory data/run.


  • Huijing Deng and Grzegorz Chrupała. 2014. Semantic approaches to software components retrieval using English queries. To appear in LREC 2014.