Using SemDiff to Analyze Repositories
SemDiff is a recommendation system that helps clients adapt to API changes made to an evolving framework. SemDiff generates its recommendations by parsing and analyzing the revision history of frameworks. SemDiff allows clients to reuse its repository analysis infrastructure to browse and perform their own analyses on the revision histories of projects.
As a repository analysis framework, SemDiff allows clients to
- Specify and connect to a CVS or SVN repository
- Download transactions (i.e., change sets or groups of co-committed source files) from the repository
- Manually browse the file pairs within transactions using Eclipse's Compare feature
- Automatically analyze file pairs within transactions using one of several detectors
- StructDiff (identifies higher-level structural differences between file pairs)
- CallDiff (identifies differences in call graphs between method pairs)
- FieldDiff (identifies differences in field references between method pairs)
- Your Own Detector (see here for more information)
To help you get started with SemDiff, refer back to its Getting Started page. This page provides additional information to help you use SemDiff's repository analysis framework for long-term investigation.
Use Multiple Profiles and Databases
If you plan on scanning multiple repositories, it is best that you set up a new Database (SemDiff -> Init Database) and a new Profile for each repository. Separating repositories in this fashion will improve the performance of your SemDiff install. Here is how I do it, per repository I scan:
- Set up a new DB (SemDiff -> Init Database) . I keep all my DB's in a central repository and create a new directory to house the DB corresponding to each repository.
- Create a new Profile. I call my profile '$Project Profile', where $Project refers to the repository I'm going to scan
- Within my profile, set up the repository.
This keeps everything nice and clean and reduces runtime overhead.
Setting up a new Repository
Setting up a repository requires you to provide an appropriate repository connection string to SemDiff. In the case of SVN, this connection string is not the same as the SVN checkout string you would use from the command line. Instead, you only provide the actual URL of the project and any relative paths (equivalent to individual modules) you wish to analyze within that project. It can sometimes be tricky to get the right repository connection strings for a given repository.
Once you are connected to a repository, you are required to generate a log file (so SemDiff has information about each transaction). You are also required to download the actual raw committed source file pairs that you would like to analyze (these file pairs are then diff'ed by individual detectors). Downloading source files can take a long time, so be patient.
Some of SemDiff's detectors use Partial Program Analysis (PPA) to infer information about transactions. PPA can be quite slow for very large source files. To optimize your SemDiff usage, you can try
- Increasing the amount of heap space available to your Eclipse session. PPA is memory intensive. I usually run my Eclipse with 6GB of perm space just to be ultra safe.
- Avoid scanning large files. You can check the LOC and number of calls of the files you will analyze, and set threshholds to avoid the largest files. To set limits on the LOC counts for individual detectors, go to Window -> Preferences -> SemDiff -> CallDiff Detector (or FieldDiff Detector) and set the appropriate bounds.