Technologies: Python, SQLite

Status: Complete, No longer maintained.

This is the final project for my data mining class in my master's degree. It uses static code analysis and an implementation of Google PageRank to calculate the risk of performing a code change in a codebase.

What works

It can parse a python library, including modules and packages, and generate a PageRank friendly database that then can be used to create a rank of the risk of each file.

The objective

Create an automated mechanism to calculate a risk factor for a code change in order to guide test efforts towards those areas.

More Information

This is a project that was created as a way of integrating data mining techniques to testing. The basic idea is simple, every time we make a code change there is a probability that other parts of the code may be impacted. However it is usually difficult to calculate exactly how much impact it actually has. We do know a couple of things:

  • The file that changed has a high risk of failing
  • Any file that uses or depends on the affected code also has a risk of failing

Then, we define impact:

  • If a failure only affects the changed file, then it has low impact (isolated).
  • The more files it affects, then the higher impact it has.

From those four statements, we can deduce that a change in a file that is used by multiple files has a higher risk of creating a high impact failure than a change in an isolated file. The question is: how do you know which file is used the most?

This problem is very similar to the 'popularity' problem solved by PageRank, the more a page is 'linked', the more popular it is. In the same way, the more a code file is 'used', then the more 'dependencies' it has. So we can say that if a code file is popular, then a change is risky.

This program uses an static parser in python (implemented using Python Abstract Syntax Trees) to analyze the code in a library or program and identify includes. We say that there is a link from module A to B if module A includes file B (includes in the python sense: 'include B' or 'from B include c'). All this information is saved on a database.

The next step is to use PageRank to calculate and distribute link weights to all registered 'documents' (python modules) and figure out the popularity of each module. Once we have that information, we can identify the risk of change relative to other parts of the program.

Due to the dynamic nature of python a lot of the includes have to be inferred from names which leads to a lot of noise. For example, a common library like pySerial results in just 40% of include matches.

My participation

I did the theoretical design and implemented the static parser. I worked with a 3 person team to implement PageRank, test and document.

What is broken

Nothing that I know of, but I didn't test with a wide range of libraries.