DMFS: Discriminatory motif-finding based feature selection
|Authors:||Hao Xiong <email@example.com>|
This program consists of discrete steps that are chained together into a pipeline. We will call the part that does partitioning, motif finding, scoring, and classification, the pipeline, which is also the part that does useful work. The remaining part deals with multiple runs of the pipeline. One running of the program could result in multiple runs of the pipeline with different random partitioning of data. When we say one run of the program, we mean the entire program, while one run of the pipeline is one instance of the pipeline using one random partitioning.
This program requires two sequence files. One file is called positive file, and the other negative file. For example, in the case of nucleosome occupancy, the positive file lists sequences that are known to be occupied by nucleosomes, the negative file has sequences that are not occupied.
Outline of the Pipeline
First, the program takes the positive sequence file and negative sequence file and randomly partition sequences in each file into two sets, one used for motif finding, the rest for classification. The fraction of sequences used for motif finding must be specified on the command line, as well as file names for the sequence files.
Second, a motif discovering program, WordSpy, is run and results are parsed to discover significant motifs. We define significant motifs as top ranked motifs by both Zscore and frequency, two metrics in WordSpy's output. The exact percentage by Zscore that are considered to be top must also be specified on the command line. The same goes for frequency. The intersection of two top ranked sets constitute the significant motifs.
Next, the significant motifs are written to a pattern file, with possible addition of hand-picked motifs, and the pattern file is used as input to fuzznuc/fuzzpro in addition to the remaining sequences from the first step. Fuzznuc/fuzzpro is an alignment program in EMBOSS package. The output is parsed and the number of occurrence of each significant motif for each sequence is recorded as a table in a score file. The number of maximum mismatches allowed while still be counted as an occurrence can be specified on the command line.
Finally, a classifier is run over the score file. In order to obtain AUC, 10-fold cross-validation is needed for SVM, while random forest has out-of-bag errors from which AUC can be calculated. The choice of classifier can be determined by command-line argument. The default is random forest.
The result is written to a file named "auc.txt" in the same directory as the temporary files are in (there is more explanation about output files later). The file "auc.txt" records the AUC value for each run of the pipeline.
This package has been tested only under Linux, and it is likely that only Linux systems will work. The OS-dependent part is the external programs that this package depends upon. The main obstacle is the program WordSpy, which only has binary distribution for Linux and Unix (unknown architecture/company/version), and Cygwin under Windows. However, since EMBOSS must be custom-compiled in Cygwin, which is non-trivial, Windows is a difficult route. MAC is wholly unsupported by WordSpy. It is recommended that the user use a Linux system.
The main issue in installation is the path to external programs, which must be visible to the package. Computer systems have a variable that stores a list of directories that the system will search in order, to see whether a program is in that directory. The list of directories is called search path. One solution to path problem is to install all required external programs in the same directory as the package itself and make sure the current directory is included in the search path. Otherwise, add location to search path, usually represented by environmental variable path under Windows or PATH under Linux/Unix.
This package makes use of other software packages.
This is a motif finding program, which can be found at http://cic.cs.wustl.edu/wordspy/.
This is a sequence alignment program and a part of EMBOSS package. One must make sure that the EMBOSS package is at least version 4.1 or above, as there is a bug in previous versions of fuzznuc. This program can be installed anywhere but make sure the location is known to the package by adding it to search path.
R, random forest
This program requires R, a popular statistics program, and the randomForest package written for R by Breiman and Cutler. This is not required if one uses SVM instead of random forest.
There are also some Python packages that must be installed.
This package provides computational tools useful to biologist. Here it is used for parsing purpose. It is recommended that at least version 1.57 be used.
This package has ability to deal with large, multidimensional arrays.
This package is a bridge between Python and R.
This package implements SVM in Python. This is necessary only if one intends to use SVM for the classification step.
The Python packages can be installed either by system-wide package manager under Linux, or by easy_install which is part of Python install program. To test whether they are already available, fire up Python interpreter and type import Bio for Biopython package, and import numpy for Numpy. If there is no error message, it means those packages are available. PyML will likely have to be custom-installed.
A user only needs to call parameterize.py in order to run the program. A usage help can be generated by using -h or --help option. The output will be similar to
Usage: parameterize.py [options]
-h, --help show this help message and exit -p POS_SEQ, --positive=POS_SEQ the name of positive sequence file -n NEG_SEQ, --negative=NEG_SEQ the name of negative sequence file -f FRAC, --frac=FRAC the fraction of sequences used for motif finding (a number between 0 and 1) -m WORDSPY_MAXLN, --max-len=WORDSPY_MAXLN the maximum length of motifs that wordspy finds (minimum value is 5 for DNAs) -S ZSCOREPERCENT, --top-score=ZSCOREPERCENT the fraction of top scored motifs for later consideration (a number between 0 and 1) -F FREQPERCENT, --top-freq=FREQPERCENT the fraction of top frequent motifs for later consideration (a number between 0 and 1) -t THREADS, --threads=THREADS the number of concurrent threads -r RUNS, --runs=RUNS the number of total runs (the number of random partitions of the data) -H MANUAL, --hand-picked=MANUAL a FASTA file listing hand-picked motifs; the default is no hand-picked motifs -d DEST, --dest=DEST the directory where all the intermediate results will be written; the default is the working directory -b BASE, --base=BASE a name identifying the dataset -c CLASSIFIER, --classifier=CLASSIFIER specify the classifier; the choices are SVM or RF (default RF) -M NMISMATCH, --mismatch=NMISMATCH the number of mismatches allowed in matching motifs against sequences -P, --protein a flag signaling that the input sequences are protein sequences; the default is DNA sequences
Notice that each option has two formats, the short and long ones. They are completely equivalent and can be mixed. The order of options does not matter, as long as arguments immediately follow the options. For example, the name of dataset can be specified by -b option by either -b BASE or --base=BASE where BASE is the name of dataset, so named because it is used as the base of temporary files.
The meaning of most options are already explained in the outline of the pipeline. So we will only explain -H, -d, -t and -r here. -H is an optional, that is, it could be left out, which means no hand-picked motifs. If specified, it takes the name of a FASTA-like file. -d takes the directory that will hold results and temporary files. Note that each run of the program collects all the temporary files and results in a separate directory, so the destination directory will hold directories, each of which represent a single run of the program. -r takes the number of runs of the pipeline, and -t records how many threads to run simultaneously. Recall that each run of the pipeline within single run of the program uses a different random partitioning of data but is otherwise identical.
Because there are two possible classifiers but only one is used at a time, a separate program, "compute.py", will run SVM over existing score files. The results are printed to the screen.
The components of the pipeline can also be used separately for flexibility. Conceptually, there are four steps of the pipeline made into stand-alone programs: random partitioning, motif discovery, scoring, and classifier accessment via AUC. The motif discovery for now is assumed to be WordSpy, which is already a stand-alone program. The other steps' programs are listed below.
partseq.py fasta frac
Random partitioning of sequence files in FASTA format. The two partitions will be stored in files with names "part1.fa" and "part2.fa" appended to the FASTA file.
Scoring remaining sequences using Fuzznuc/Fuzzpro and the motifs found by WordSpy. The scores are stored in file name that replaces the first part of WordSpy file name with "score". For example, if the motifs are in "wordspy.m12.txt", then the scores file is "score.m12.txt".
Scoring remaining sequences using MOODS and PWMs outputted by WordSpy. Note that in order for WordSpy to output PWMs, it is best to add "-p 0." to the command-line to lower the threshold for printing PWMs. The output file is fixed as "pfm_scorefile.txt". The default value for "-pval" is 0.1.
Computing AUCs of scores. It can choose whether to use random forest or SVM and optionally pass tuning parameters to the classifiers. SVMs take Gamma and C values, while random forests have node size, number of trees, and mtry as tuning parameters. The AUC is printed as the last line of the output.
The following is an example of using components:
./partseq ozsolak.positive.fa 0.33 ./partseq.py ozsolak.negative.fa 0.33 wordspy ozsolak.positive.part1.fa 12 wordspy.m10.txt -e ozsolak.negative.part1.fa -r0 -f -s -b -d -c ./score_discrete.py -w wordspy.m12.txt -p ozsolak.positive.part2.fa -n ozsolak.negative.part2.fa -F 0.5 -S 0.15 -M 2 ./compute.py -f score.m12.txt -c 'rf'
The above is equivalent to:
./parameterize.py -p ozsolak.positive.fa -n ozsolak.negative.fa -f 0.33 -m 12 -S 0.5 -F 0.15 -t 1 -r 1 --base dennis -c RF --dest . -M 2
The result is a file, "auc.txt", that stores the AUC values of each run. "auc.txt" is put in the same directory as the temporary files are. Note that the order of AUC values in auc.txt is non-deterministic so one cannot find matching AUC value to its corresponding temporary files.
Expected Input File Format
The inputs are the positive and negative sequence files, with possible addition of hand-picked motifs. The sequence files must be valid FASTA files.
The hand-picked motif file consists of lists of motifs. Each motif has two lines, the first begins with a > which is immediately followed by the name of the motif. The second line is the sequence of the motif.
Within the destination directory where the temporary files are kept, there is one directory for each run of the program through "parameterize.py". Each directory has a name reflecting the parameters used to invoke the program. Each invocation results in one directory within which there are multiple, versioned temporary files that record multiple runs of the pipeline within the program.
The directory name within the destination directory has format:
where the terms within <> represent the command line argument values, and the part within square brackets is optional. For example, for ozsolak dataset, motif of maximum length 7, 72% of sequences used for motif finding, 46% of top motifs by Zscore and 46% of top motifs by Frequency intersected, with manual motifs, the directory name would be ozsolak_m7_f72_S46_F77_Hmanual .
Because directories are named after parameter values, two runs with identical parameter values cannot be placed into same destination directory.
The actual temporary files consist of partitioned sequence files, wordspy output files, motif pattern files as inputs to fuzznuc, fuzznuc output files, and scores that hold counts. They all have suffix ".txt" and before that a version number. All files with same version number are from same run of the pipeline. The wordspy outputs files have a leading dot, followed by parameter values for wordspy.
There are also a log file named _log.txt and an empty file named __init__.py options.py records all the parameter values for the run of the program.
The package consists of the following files:
computes AUCs using either SVM or random forest.
handles parsing of WordSpy output and extracting motifs.
instantiates various parameters and starts execution. The user will only have to deal with this file to run the program.
implements versioned files.
randomly partitions a set of sequences from a file in FASTA format.
acts as a manager of pierces of this program.
runs fuzznuc and parses the output.
is a R program file that runs random forest for classification.
scores remaining sequences using motifs found by WordSpy.
scores remaining sequences using PWMs of motifs generated by WordSpy.
contains various utility functions for files.
makes file names are versioned uniquely.
is essentially this file.