HTTPS SSH

FLUCCS Data Artifact

FLUCCS (Fault Localization Using Code and Change Metrics) is a fault localisation approach which essentially extends SBFL techniques with code and change metrics. FLUCCS's main argument is that, by including code and change metrics, fault localisation performance can be improved. This artifact, defects4j-fluccs, contains the implementation of FLUCCS as well as the dataset used to evaluate it in the accompanying paper. Data sets generated by FLUCCS consist of suspiciousness scores from existing SBFL formulas as well as code and change metric values (age, churn, and complexity).


defects4j-fluccs

defects4j-fluccs is implemented on top of defects4j, a collection of reproducible Java faults. The usage of defects4j-fluccs is quite similar to the usage of defects4j. Since it is implemented as an extension of defects4j, basic tasks provided by defects4j can be also used in defects4j-fluccs.

Tasks that are specific to defects4j-flucss are as the following.

Command Description Arguments
fluccs-prepare Prepare for executing defects4j-fluccs operations -p: project_name, -b: bug_number, -w: working_directory, -c: use_cobertura
fluccs-coverage Calculate coverage per method -p: project_name, -b: bug_number, -w: working_directory, -c: use_cobertura
fluccs-codeAndchange Calculate code and change metrics per method -p: project_name, -b: bug_number, -w: working_directory
fluccs-complexity Calculate complexity per method -p: project_name, -b: bug_number, -w: working_directory
fluccs-gather Gather generated metrics and combine them to final data file -p: project_name, -b: bug_number, -w: working_directory, -g: use_gpu, -n: use_norm, -d: del_intermediate_fs
fluccs-stmt_mth_pair Generate information showing where the statement comes from -p: project_name, -b: bug_number, -w: working_directory, -c: use_cobertura (this command is called internally during the execution of fluccs-prepare)
fluccs-gp Generate evolved ranking model using Genetic Programming -dest(dst): destination, -datadir(dd): data_directory, -resultId(i): result_id, -pairId(n): pair_id, -pairFile(pf): name_of_pair_file, -numFold(nf): the_number_of_folds -minDepth(minD): minimum tree depth for gp, -maxDepth(maxD): maximum tree depth for gp, -initMaxDepth(initMaxD): maximum tree depth for initialization of trees, -usegpu(ug): 1 if using GPU else 0

To execute these FLUCCS specific commands,

defects4j-fluccs 1 FLUCCS command [required arguments]

For basic defects4j commands,

defects4j-fluccs 0 defects4j-command [required arguments]

Detailed descriptions about FLUCCS specific commands will be explained later.

Supported Faults

defects4j-fluccs supports 386 faults out of 395 faults from projects Lang, Time, Math, Closure, Chart, and Mockito. The following table lists the faults from these four projects that we excluded.

Project Number of faults Excluded faults
Commons Lang 63 23, 56
Joda Time 26 11
Commons Math 105 104
Closure Compiler 131 83, 105
JfreeChart 25 10
Mockito 36 16, 26

Dependencies and Requirements

  • defects4j version >= 1.1.0
    • if you use Ubuntu 16.04 (Xenial),
      • sudo apt-add-repository ppa:justinludwig/tzdata
      • sudo apt-get update
      • sudo apt-get install tzdata-java
      • make sure that the downloaded tzdata-java is used for java7 -/usr/lib/jvm/java-7-openjdk-amd64/jre/lib/zi
        • if it is a symbolic link (e.g. /usr/lib/jvm/java-7-openjdk-amd64/jre/lib/zi -> /usr/share/javazi), check again whether the original directory exists
    • for preventing unexpected errors caused by java.util.TimeZone
      • This instruction is directly related to using Defects4j
  • Java version 1.7 (from Defects4J) & 1.8
  • Perl version >= 5.0.10

    • For Try::Tiny and Switch modules,
      • sudo cpan (if you don't have cpan, install cpan; typing cpan will lead to automatic installation step.)
      • install Try::Tiny
      • install Switch
      • install File::Spec
    • For XML::LibXML module,
      • sudo apt-get install libxml-libxml-perl
  • python2 version >=2.7.6 with the following packages: deap, numpy, scipy, and pycuda (optional)

    • pip install deap
    • pip install numpy scipy
    • pip install sklearn
    • pip install pycuda (if a compatible GPU is available)
  • git version >= 2.5.0
  • git-svn version >= 2.7.4
  • svn >= 1.9.3
  • ant version >= 1.9.3
  • maven version >= 3.0
    • write export JAVA_HOME=/usr/lib/jvm/java-8-openjdk-amd64 in ~/.mavenrc
  • p7zip Version >= 9.20

Getting started

  1. Install defects4j. With the latest commit ce90ad5a29353f57aa4e555aad0af01daba6c8f8 (compatibility with defects4j 1.2.0 is yet to be checked)

    git clone https://github.com/rjust/defects4j git checkout ce90ad5a29353f57aa4e555aad0af01daba6c8f8

  2. Set defects4j's path as D4J_HOME environment variable. export D4J_HOME=path_to_defects4j

  3. Install defects4j (Go to defects4j Installation for further instructions).

  4. Clone FLUCCS's repository under path_to_defects4j/framework/bin.

    • As a result, directory fluccs will be made under path_to_defects4j/framework/bin. git clone https://kasio555@bitbucket.org/teamcoinse/fluccs.git
  5. Execute fluccs/init.sh

defects4j Installation

  1. Move to head of defects4j directory(path_to_defects4j) and initialize.

    cd path_to_defects4j ./init.sh

  2. Add defects4j's executable path to your PATH.

    export PATH=$PATH:path_to_defects4j/framework/bin

README.md under path_to_defects4j contains information about overall installation and use of defects4j.

Using defects4j-fluccs

Precomputed Data

  • To ease the burden of downloading and configuring this artifact, we have computed some of the data and have included them in the artifact: these are required in order to generate the metric data.
    • bcel/Project_Jar: contains jar files for each faulty version of the source code.
    • bcel/output: contains 7z files for each project (i.e. Project.7z). Executing fluccs/init.sh will extract these 7z files. Each directory named with related project contains directories per each fault of the project. Under these per-fault directories are csv files for the computed complexity metrics of declared methods in each faulty code base. The call graph relationships between methods are alo written, however, this information (method call file) will not be used (not completed yet).
    • fault_list: contains files where identifiers of faulty methods are written.

Preparing the Dataset

  1. Go to the working directory.
  2. Checkout the source code version which introduces the target fault. defects4j-fluccs 0 checkout -p Project_Name(Lang|Math|Time|Closure) -v bug_number(f(fixed)|b(buggy)) -w working directory i.e. for fault Lang 2 with working directory lang_2_b, defects4j-fluccs 0 checkout -p Lang -v 2b -w lang_2_b -c 1
  3. Make preparation for the overall defects4j-fluccs operations. defects4j-fluccs 1 fluccs-prepare -p project_name(Lang|Math|Time|Closure) -b bug_number -w working_directory -c use_cobertura i.e. for fault Lang 2 with working directory lang_2_b and using Cobertura for further processing (i.e. generating coverage) defects4j-fluccs 1 fluccs-prepare -p Lang -b 2 -w lang_2_b -c 1 i.e for fault Lang 2 with working directory lang_2_b and using Jacoco as coverage tool defects4j-fluccs 1 fluccs-prepare -p Lang -b 2 -w lang_2_b -c 0 Using Jacoco as coverage tool is not recommended and this option(-c) will be deprecated, thereby using Cobertura as a sole coverage tool.
  4. Generate data for specific metrics.

    • Program Spectra defects4j-fluccs 1 fluccs-coverage -p project_name(Lang|Math|Time|Closure) -b bug_number -w working_directory -c use_cobertura
      • Output file method_spectra.csv will be created under working_directory/output.
      • Data Format : method_identifier(class_name$method_name<arguments>),s1_ep,s1_np,s1_ef,s1_nf,s2_ep,s2_np,s2_ef,s2_nf ... ( s# indicates statement # in the method with 'method_identifier')
    • Age and churn defects4j-fluccs 1 fluccs-codeAndchange -p project_name(Lang|Math|Time|Closure) -b bug_number -w working_directory
      • Output file methodAgeAndChurns.csv will be created under working_directory/output.
      • Data Format : method_identifier(class_name$method_name<arguments>),churn,max_age,min_age,normalized_churn,normalized_max_age,normalized_min_age
    • Code Complexity defects4j-fluccs 1 fluccs-complexity -p project_name(Lang|Math|Time|Closure) -b bug_number -w working_directory
      • Output file method_complexity.csv will be created under working_directory/output.
      • Data Format : method_identifier(class_name$method_name<arguments>),number_of_arguments,number_of_local_variables,number_of_complied_JavaBytecode,Line_of_Code
  5. Generate the final data file by gathering the produced metrics and calculating suspiciousness scores for SBFL formulas using computed spectra data.

    defects4j-fluccs 1 fluccs-gather -p project_name(Lang|Math|Time|Closure) -b bug_number -w working_directory -g (0|1) -n (0|1) (-d) - Assume all metric files, method_spectra.csv, methodAgeAndChurns.csv, method_complexity.csv, are located under working_directory/output

    • To speed up the computation of suspiciousness scores from SBFL formulas, you can use CUDA with the -g flag:

    • -g 0: use CPU to calculate suspiciousness scores

    • -g 1: use GPU to calculate suspiciousness scores

    • Intermediate output file method_all.csv and Project_faultid.SP.dat will be created under working_directory/output.

      • Data Format :
        • for method_all.csv, method_identifier(class_name$method_name<arguments>),spectra,age,complexity,churn ; spectra, age, churn, complexity parts in line will be same order of previous file format (i.e. age and churn: churn, max_age, min_age)
        • for Project_faultid.SP.dat, (index for faults(list), list of (method_identifier(class_name$method_name<arguments>),spectra,age,complexity,churn))
    • If -d is given, then intermediate files method_all.csv and Project_faultid.SP.dat will be discarded

    • Delete all intermediate files (for the fault (Lang & faultid: 1) with normalization and using gpu) defects4j-fluccs 1 fluccs-gather -p Lang -b bug_number -w working_directory -g 1 -n 1 -d

    • Final output file Project_faultid.dat, created using python module pickle, will be generated under working_directory/output (all values are normalized: between 0 and 1)

      • Data Format : consist of two parts: indice to faulty methods (list type), method metric vectors. The method metric vectors are in the following data format: method_identifier(class_name$method_name<arguments>),ochiai,jaccard,gp13,wong1,wong2,wong3,tarantula,ample,RussellRao,SorensenDice,Kulczynski1,SimpleMatching,M1,RogersTanimoto,Hamming,Ochiai2,Hamann,Hamann,Kulczynski2,Sokal,M2,Goodman,Euclid,Anderberg,Zoltar,ER1a,ER1b,ER5a,ER5b,ER5c,gp02,gp03,gp19,churn,max_age,min_age,complexity
    • RuntimeWarning: divide by zero encountered in double_scalars has been taken care of while generating the final data file, Project_bugid.dat.

Evolving Ranking Models

  • Generates a ranking model using Genetic Programming: defects4j-fluccs 1 fluccs-gp -dst|dest destination -dd|datadir data_directory(directory where data files are stored) -i|resultId result_id -n|pairId pair_id -pf|pairFile name_of_pair_file -nf|numFold the number of folds -minD|minDepth minimum tree depth for tree-based gp -maxD|maxDepth maximum tree depth for tree-based gp -initMaxD|initMaxDepth maximum tree depth for initialization of trees in tree-based gp -ug|usegpu 1 if using GPU else 0
    • destination : directory where output files(formula files, result files) will be stored
    • data_directory : data directory for the GP (i.e. the directory that contains the results from previous data generation step).
    • result_id : to distinguish each result when there are multiple of them, user can give a specific id for the ranking model.
    • pair_id : FLUCCS uses 10-fold cross validation. Each fold is specified with a number between 0 to 9; this number is called pair id and the data with currently chosen pair id will be used as test data whereas the other data with different pair id will be used as training data.
    • pairFile : name of the pair file to use. If user wants to generate a new pair file do not give any argument for this
      • -pairFile pair.txt: use pair.txt as the pair file
      • nothing is given: generate new pair file (under current working directory with name "pair.txt")
    • the number of folds : k for k-fold cross validation. e.g. 10 for 10 fold cross-validation
    • minDepth: minimum tree depth for tree-based gp. if none are given (does not use this argument (undefined), then default value 1 is used
    • maxDepth: maximum tree depth for tree-based gp. if none are given (does not use this argument (undefined), then default value 8 is used
    • initMaxDepth: maximum tree depth for the initialization of trees for tree-based gp. if none are given (does not use this argument (undefined), then default value 6 is used
    • usegpu : 1 if gpu is used else 0
    • Output files ( under destination)
      • dest/formula: formula files (result_id.result.csv files)
      • dest/rank: ranking result files (Project_faultid.postfix.iterid.(faults|ranks).csv: i.e. Lang_1.GP.0.faults.csv)
      • faults.csv: rankings of faulty methods
      • ranks.csv: rankings of all methods (THE ORDER OF RANKING IS SAME WITH CORRESPONDING METHOD FEATURE FILES)
      • i.e. if a user wants to generate ranking model using data in Data directory with a pair file pair.txt, result id 0, 10 fold cross validation, pair(fold) id 1, using gpu, minimum, maximum tree depth as 8, initial tree maximum depth as 6 and current directory (.) as the destination then, defects4j-fluccs 1 fluccs-gp -dest . -datadir Data -resultId 0 -numFold 10 -pairId 1 -pairFile pair.txt -minD 8 -maxD 10 -initMaxD 6 -ug 1
      • i.e. if the user want to generate ranking model with same setting with the above example except using default values for minD, maxD, initMaxD instead of the specified values and generating a new pair file, then, defects4j-fluccs 1 fluccs-gp -dest . -datadir Data -resultId 0 -numFold 10 -pairId 1 -ug 1

Directory structure for defects4j-fluccs

Under defects4j executables directory ( D4J_HOME/framework/bin )

|
|--- fluccs                                                 
        |--- checkout       fluccs-prepare
                    |--- Lang   Contains Lang-specific resource files 
                    |--- Math   Contains Math-specific resource files
                    |--- Time   Contains Time-specific resource files
                    |--- Closure    Contains Closure-specific resource files
                    |--- Chart  Contains Chart-specific resource files
                    |--- Mockito    Contains Mockito-specific resource files
                    |--- prepare    Contains files related to prepare the environment of FLUCCS 
        |--- gen_stmt_mth_pair          fluccs-stmt_mth_pair
        |--- coverage               fluccs-coverage
        |--- CodeAndChangeM         fluccs-codeAndchange
        |--- complexity             fluccs-complexity
        |--- bcel               fluccs-complexity: contain precomputed complexity metric files (+ java-callgraph: Java Call Graph Utilities) 
        |--- gather             fluccs-gather : gathers all generated data into a single file
        |--- sbfl_metrics           compute SBFL suspiciousness scores 
        |--- method_stmt            statement and method pair files
        |--- fault_list             faulty method files
        |--- gp                 fluccs-gp
        |--- perl               perl modules
        |--- python             python modules
        |--- modifiedD4js           modified *defects4j* files
        |--- feature                feature definition related files