Zhaopeng Tu

1. Preprocess

You should extract all java files from the project, and do necessary preprocess: remvoe comments, tokenization. Go to the annotation folder, and run './':

./ input_dir output_dir

The input_dir is the original project folder, and output is the folder that stores all tokenized codes. Generally, the output files are indexed by the number, and contains:         // the un-tokenized code without comments   // the tokenized codes stored in lines  // the tokenized codes written in a single line (for training LM)

(If you try to deal with other programming languages, say C, modify the Lexer (LexJava-1.0.jar) in

 19                 os.system('java -jar bin/LexJava-1.0.jar %s' % output))

2. Evaluation

After preprocessing, you should rename the output folder to "files" and put them in the "evaluation" folder. 

For example, mkdir a folder named "data" in "evaluation", and the structure is:
 | evaluation
     | data
         | sample_project
             | files    // this contains all processed files obtained from Step1

2.1 Training

The default setting is that we use 10-fold cross-validation for training (we split the files into 10 parts, then for each part, we train the language model from the other 9 parts, and test on the give part).


./bin/ data/sample_project 3

Here "3" denotes the order of n-grams trained from the "sample_project". Here we use 3-grams.

You will find all the trained files in the "sample_project", whose structure is as following:
 | evaluation
     | data
         | sample_project
             | files    // this contains all processed files obtained from Step1
             fold0.test              // the test file list for fold0
             fold0.train.3grams      // the trained language model for fold0

Possible Troubles:
1. If you find the LM tools 'ngram' and 'ngram_count' cannot work on your machine. Try to download the latest version from the website '', and compile it on your own machine. Replace the original ones with your compiled procedures.

2.1 Test

Try to find the full command list by typing "./completion".

    the necessary parameters: 
    -INPUT_FILE         the input file
    -NGRAM_FILE         the ngrams file
    -NGRAM_ORDER        the value of N (order of lm)

    the optional parameters:
    -ENTROPY            calculate the cross entropy of the test file
                        rather than providing the suggestions
    -TEST               test mode, no output, no debug information
    -FILES              test on files or not, default on a single file
    -DEBUG              output debug information
    -OUTPUT_FILE        the output file
    -BACKOFF            use the back-off technique
    -CACHE              use the cache technique 
    -CACHE_ONLY         only use the cache technique without ngrams
    -CACHE_ORDER        the maximum order of ngrams used in the cache (default: 3)
    -CACHE_DYNAMIC_LAMBDA   dynamic interpolation weight for -CACHE (H/(H+1)), default option
    -CACHE_LAMBDA       interpolation weight for -CACHE
    -WINDOW_CACHE       build the cache on a window of n tokens (default n=1000)
    -WINDOW_SIZE        the size of cache, default: 1000 tokens
    -FILE_CACHE         build the cache on a file or related files
    -SCOPE_FILE         the scope file for scope cache on CLASS or METHOD
    -RELATED_FILE       when using cache on file scope, build the cache on the related files
                        FILE_DIR should be given
    -FILE_DIR           the directory that stores all files

Try to find examples from "entropy.bat" and "suggestion.bat" for calculating entropies and code suggestion, respectively.