Author: Grzegorz Chrupała <>
Date: 2013-03-04

This project contains code used to carry out the experiments described in [EACL_2012]. You may also want to obtain Ladybug dataset on which the experiments were run.


The package consists of the following executables:

  1. ladybug-features - Feature extractor for bug reports.
  2. ladybug-run - Learning and prediction of labels for bug reports.

You need the Vowpal Wabbit machine learning toolkit to use ladybug. You can compile and install it from source. On Ubuntu or Debian, you can install the vowpal-wabbit package. Either way, make sure you have the Vowpal Wabbit executable (called vw) installed somewhere in your path.

To compile ladybug-features and ladybug-run you should first install the Haskell Platform. Once you have it simply do the following:

cabal update
cabal install --prefix=DIRECTORY

Replace DIRECTORY with the directory where you want to install the executable. Make sure ladybug-features and ladybug-run are in your PATH.


ladybug works by interleaving learning and prediction. You can run it on a series of labeled bug reports and for each report it will output the labels predicted for it, given all the previous labels seen so far in the data stream.

Pass the data stream on standard input to ladybug-features and then pipe the output to ladybug-run:

cat data/example.json \
    | ladybug-features areas \
    | ladybug-run progressive 26 model > output


Each line of the output corresponds to one input bug report.

On this line the possibles labels are separated by commas and ranked from most probable to least probable.

There will be as many labels on each line as the model has seen in the input up to the previous report, plus one "dummy" label which the model is initialized with. The dummy label is output as "1".


Overview of options to control the programs.


The first argument to ladybug-features controls which field of the bug reports is used as a label. There are three main cases:

  • areas: This option will look for tags prefixed with Area- in the "labels" field of a bug report and use the concatenation of these areas as an atomic label. This option is useful for the Chromium dataset.
  • component: This option will look for tags prefixed with Component- in the "labels" field of a bug report and use all of them as a labels.
  • assignedTo: This option will use the value of the field "assignedTo" as a label. If the value of this field starts with "nobody" or "all-bugs-test", the bug report will be treated as unlabeled.

For example if you are working with the Chromium dataset and would like to use the "Area-" tags as labels, you'd run this pipeline:

cat INPUT | ladybug-features areas \
   | ladybug-run progressive 26 model > OUTPUT

Whereas if you wanted to use the assignedTo field as labels for the same dataset, you'd use:

cat INPUT | ladybug-features assignedTo \
   | ladybug-run progressive 26 model > OUTPUT


ladybug-run can run in a progressive mode which interleaves learning and prediction, or in pure prediction mode, where is simply used a previously learned model to predict labels for new data. The mode is controlled by the first argument to ladybug-run:

ladybug-run progressive SIZE MODEL-PATH

which runs in progressive mode, with model size set to SIZE bits, and saves the model to MODEL-PATH. In contrast:

ladybug-run predict MODEL-PATH

uses the model in MODEL-PATH to predict new labels, and does not preform any learning.

For optimum results, use the maximum size of the model allowed: 29 bits. You may need to set it to a lower value if you don't have enough RAM.


If you want to experiment with the data I used in [EACL_2012] you can download the dataset from <>. A small data sample is also included in the data directory.

[EACL_2012](1, 2) Grzegorz Chrupala. 2012. Learning from evolving data streams: online triage of bug reports. EACL.