WARP is an alignment-free tool for ultra-fast protein homology detection. It evaluates the similarity between two proteins by computing an approximate Dynamic Time Warping score on some compressed numeric representation of the target proteins. It then evaluates the likelihood of two proteins being homologous using a Random Forest classifier. The corresponding scientific paper is corrently under review.

What is this repository for?

The code here is thus devoted to show a running example of the concepts described in the paper (such as the iDCTquantization) and you are free to use them and hack it as you wish.

Please consider that this is still an early version which comes as a scripted code. Its sole purpose is to provide a sketchy proof of concept about how WARP works, in relation to the method explained in the article. We will provide soon a comprehensive and really usable implementation of WARP with the goal of substantially speeding up the time required for homology detection in the daily life of structural bioinformaticians. This task will awnyway require a less theoretical approach and a specific infrastructure that at the moment we cannot provide.

How do I get set up?

WARP has some dependencies:

  • python 2.6 or 2.7 must be installed
  • scikit-learn python library
  • fastdtw python library
  • scipy and numpy python libraries
  • a running version of PSIPRED (it is not currently used by the WARP script in the repository)

All the python libraries can be easily installed with pip.

What is this repository contains?

  • The trainedModels folder contains the scikit-learn Random Forest model trained on the PFAM dataset from Saripella et al., 2016.
  • The sources folder contains utilities python source codes and the iDCTvector quantization code.
  • The vector_builder folder contains the code of the Dynamine predictor.
  • The reproduceBenchmark folder contains some of the scripts we used to compute the results shown in the paper.

How do I reproduce the results of the paper?

The reproduceBenchmark folder contains some of the scripts we used to compute the results shown in the paper. In particular:

  • the script reproduces the results shown in Tables 3,4,5. The target dataset can be changed by using the variable DATASET_BENCHMARK ar line 43. K can be changed using the variable LEN_FFT.
  • the script reproduces the results shown in Table 1,2.
  • the script reproduces the results shown in Suppl. Table S3.
  • the script reproduces the plot shown as Suppl. Fig. S18.

All those scripts run directly by calling them as: "python". The parameteres can be changed by modifying the source code. The scripts are poorly commented because they are not intended for distribution but we used them for prototyping during the development of WARP. Feel free to hack them, but expect some adventures.

In order to make these script work, the reproduceBenchmark/ folder contains also:

  • the Dynamine and single-sequence PSIPRED in the reproduceBenchmark/data folder
  • part of the benchmark from Saripella et al., 2016 in reproduceBenchmark/Homology_Benchmark folder
  • additional source files in reproduceBenchmark/sources

Who do I talk to?

The main page for the project is .