1. Armen Abnousi
  2. NADDA


This file will guide you through the steps to install and run the NADDA conserved region detection program.
The current release includes three different modules for data set generation and for conserved region detection.
You will need to provide a fasta file (and the Pfam output if you want to use your own training set) to the 
dataset generation modules and then feed the generated datasets to the prediction module for detection of 
conserved regions. Here we will address each of these two steps separately.

Dataset Generation: (MapReduce)
Run make to generate profile executable file. You will need to have MRMPI and BOOST libraries installed.
(Set the path to MRMPI library in the Makefile.)
There are two script files that use profiler, one for generating training datasets and the other for generating
test datasets. You need to generate your own test dataset, but you do not have to provide a training set; in which
case the program will use our pre-trained model. However, to use the provided model, you should unzip the 
model.zip file and make sure the resulting model.defualt file is in the same directory as nadda.py.

The train_dataset_gen.sh is used for generating training datasets. You are asked to enter the fasta file for 
training sequences and the file for the Pfam output of these sequences.

The test_dataset_gen.sh is used for generation of test sets. You are asked to enter the fasta file including the
sequences that you want to run the NADDA on.

Set the name for following parameters in the script file (dataset_gen.sh):
*Fasta file name (make sure there is no multiple copies of one sequence name in the file.)
*pfam output for the given file (for train_dataset_gen.sh) (make sure names in pfam output does exactly match the names in fasta file).
*You can also set the kmer size and dataset file name here. By default it kmer size is set to 6.
*A default threshold of "1" is set in this script file. This is the minimum number of sequences required to 
include a domain in order for that domain to be marked as a domain region (based on Pfam) in the data set.
This value is suggested to be set to a larger number (e.g. 50) when generating a  training dataset.
For the test dataset, the threshold will only affect the scores of prediction when compared to Pfam, but does
not change the results of prediction.
*The parameter sep_delta is used by the MRMPI library. Make sure its value is larger than the longest sequence in 
the fasta file.

*You will also need to make any changes required to submit a job your cluster. The current files use mpiexec and are
set to use 32 processors.

Conserved region detection: (NADDA)
You will need python v3+ installed. You will also need to install scikit-learn and pickle on your python.
Run the nadda.py file in the following format:

"python nadda.py -i (training dataset file) -t (test dataset file) -m (mss) -v (max features) -w (window size) -M (model file) -k (kmer size)"

Test file (-t) is the only mandatory field to run the program.

Make sure that the kmer size is the same used for generation of the data sets. A default value of 6 is used here.

A default model is provided with this release. Without entering any training file and model file, the default
model will be loaded.

If a training file and model file name are provided, the model trained using the training set will be stored 
under the given model name. Otherwise a default name using the training file name will be used.

If a training file is not provided but a model file name is entered, the script will try to load a model from 
the disk stored under that name.

The -m, -v and -w options are classifier parameters and are optional.

For questions and support contact aabnousi@eecs.wsu.edu