+ DMFS: Discriminatory motif-finding based feature selection
+This program consists of discrete steps that are chained together into
+a pipeline. We will call the part that does partitioning, motif
+finding, and classification, the pipeline, which is also the part that
+does useful work. The remaining part deals with multiple runs of the
+pipeline. One running of the program could result in multiple runs of
+the pipeline with different random partitioning of data. When we say
+one run of the program, we mean the entire program, while one run of
+the pipeline is one instance of the pipeline using one random
+This program requires two sequence files. One file is called positive
+file, and the other negative file. For example, in the case of
+nucleosome occupancy, the positive file lists sequences that are known
+to be occupied by nucleosomes, the negative file has sequences that
+First, the program takes the positive sequence file and negative
+sequence file and randomly partition sequences in each file into two
+sets, one used for motif finding, the rest for classification. The
+fraction of sequences used for motif finding must be specified on the
+command line, as well as file names for the sequence files.
+Second, a motif discovering program, Wordspy, is run and results are
+parsed to discover significant motifs. We define significant motifs
+as top ranked motifs by both Zscore and frequency, two metrics in
+Wordspy's output. The exact percentage by Zscore that are considered
+to be top must also be specified on the command line. The same goes
+for frequency. The intersection of two top ranked sets constitute the
+Next, the significant motifs are written to a pattern file, with
+possible addition of hand-picked motifs, and the pattern file is used
+as input to fuzznuc/fuzzpro in addition to the remaining sequences
+from the first step. Fuzznuc/fuzzpro is an alignment program in
+EMBOSS package. The output is parsed and the number of occurrence of
+each significant motif for each sequence is recorded as a table in a
+score file. The number of maximum mismatches allowed while still be
+counted as an occurrence can be specified on the command line.
+Finally, a classifier is run over the score file. In order to obtain
+AUC, 10-fold cross-validation is needed for SVM, while random forest
+has out-of-bag errors from which AUC can be calculated. The choice of
+classifier can be determined by command line argument. The default is
+The result is written to a file named =auc.txt= in the same directory as
+the temporary files are in (there is more explanation about output
+files later). The file =auc.txt= records the AUC value for each run of
+This package has been tested only under Linux, and it is likely that
+only Linux systems will work. The OS-dependent part is the external
+programs that this package depends upon.The main obstacle is the
+program Wordspy, which only has binary distribution for Linux and Unix
+(unknown architecture/company/version), and Cygwin under Windows.
+However, since EMBOSS must be custom-compiled in Cygwin, which is
+non-trivial, Windows option is a difficult route. MAC is wholly
+unsupported by Wordspy. It is recommended that the user use a Linux
+The main issue in installation is the path to external programs, which
+must be visible to the package. Computer systems have a variable that
+stores a list of directories that the system will search in order, to
+see whether a program is in that directory. The list of directories
+is called search path. One solution to path problem is to install all
+required external programs in the same directory as the package itself
+and make sure the current directory is included in the search path.
+Otherwise, add location to search path, usually represented by
+environmental variable "path" under Windows or "PATH" under
+This package makes use of other software packages.
+ This is a motif finding program, which can be found at
+ This is a sequence alignment program and a part of EMBOSS
+ package. One must make sure that the EMBOSS package is at least
+ version 4.1 or above, as there is a bug in previous versions of
+ fuzznuc. This program can be installed anywhere but make sure the
+ location is known to the package by adding it to search path.
+ This program requires R, a popular statistics program,
+ and the randomForest package written for R by Breiman and Cutler. This is not
+ required if one uses SVM instead of random forest.
+There are also some Python packages that must be installed.
+ This package provides computational tools useful to
+ biologist. Here it is used for parsing purpose.
+ This package has ability to deal with large, multidimensional
+ This package implements SVM in Python. This is necessary only
+ if one intends to use SVM for the classification step.
+The Python packages can be installed either by system-wide package
+manager under Linux, or by easy_install which is part of Python
+install program. To test whether they are already available, fire up
+Python interpreter and type =import Bio= for Biopython package, and
+=import numpy= for Numpy. If there is no error message, it means those
+packages are available. PyML will likely have to be custom-installed.
+A user only needs to call parameterize.py in order to run the program.
+A usage help can be generated by using "-h" or "--help" option. The
+output will be similar to
+Usage: parameterize.py [options]
+ -h, --help show this help message and exit
+ -p POS_SEQ, --positive=POS_SEQ
+ the name of positive sequence file
+ -n NEG_SEQ, --negative=NEG_SEQ
+ the name of negative sequence file
+ -f FRAC, --frac=FRAC the fraction of sequences used for motif finding (a
+ number between 0 and 1)
+ -m WORDSPY_MAXLN, --max-len=WORDSPY_MAXLN
+ the maximum length of motifs that wordspy finds
+ (minimum value is 5 for DNAs)
+ -S ZSCOREPERCENT, --top-score=ZSCOREPERCENT
+ the fraction of top scored motifs for later
+ consideration (a number between 0 and 1)
+ -F FREQPERCENT, --top-freq=FREQPERCENT
+ the fraction of top frequent motifs for later
+ consideration (a number between 0 and 1)
+ -t THREADS, --threads=THREADS
+ the number of concurrent threads
+ -r RUNS, --runs=RUNS the number of total runs (the number of random
+ partitions of the data)
+ -H MANUAL, --hand-picked=MANUAL
+ a FASTA file listing hand-picked motifs; the default
+ is no hand-picked motifs
+ -d DEST, --dest=DEST the directory where all the intermediate results will
+ be written; the default is the working directory
+ -b BASE, --base=BASE a name identifying the dataset
+ -c CLASSIFIER, --classifier=CLASSIFIER
+ specify the classifier; the choices are SVM or RF
+ -M NMISMATCH, --mismatch=NMISMATCH
+ the number of mismatches allowed in matching motifs
+ -P, --protein a flag signaling that the input sequences are protein
+ sequences; the default is DNA sequences
+Notice that each option has two formats, the short and long ones.
+They are completely equivalent and can be mixed. The order of options
+do not matter, as long as arguments immediately follow their options.
+For example, the name of dataset can be specified by ``-b`` option by
+either ``-b BASE`` or <code> --base=BASE </code> where ``BASE`` is the name of dataset,
+so named because it is used as the base of temporary files.
+The meaning of most options are already explained in the outline of
+the pipeline. So we will only explain ``-H``, ``-d``, ``-t`` and ``-r`` here.
+``-H`` is an optional, that is, it could be left out, which means no
+hand-picked motifs. If specified, it takes a name of a FASTA-like
+file. ``-d`` takes the directory that will hold results and temporary
+files. Note that each run of the program collects all the temporary
+files and results in a separate directory, so the destination
+directory will hold directories, each of which represent a single run
+of the program. ``-r`` takes the number of runs of the pipeline, and
+``-t`` records how many threads to run simultaneously. Recall that each
+run of the pipeline within single run of the program uses a different
+random partitioning of data but is otherwise identical.
+Because there are two possible classifiers but only one is used at a time, a
+separate program, svm.py, will run SVM over existing score files. The
+results are printed to the screen.
+The result is a file, auc.txt, that stores the AUC values of each run.
+auc.txt is put in the same directory as the temporary files are. Note
+that the order of AUC values in auc.txt is non-deterministic so one
+cannot find matching AUC value to its corresponding temporary files.
+Expected Input File Format
+The inputs are the positive and negative sequence files, with possible
+addition of hand-picked motifs. The sequence files must be valid
+The hand-picked motif file consists of lists of motifs. Each motif
+has two lines, the first begins with a ``>`` which is immediately followed
+by the name of the motif. The second line is the sequence of the
+Within the destination directory where the temporary files are kept,
+there is one directory for each run of the program through
+parameterize.py. Each directory has a name reflecting the parameters
+used to invoke the program. Each invocation results in one directory
+within which there are multiple, versioned temporary files that record
+multiple runs of the pipeline within the program.
+The directory name within the destination directory has format
+where the terms within =<>= represent the command line argument
+values, and the part within square brackets is optional. For example,
+for ozsolak dataset, motif of maximum length 7, 72% of sequences used
+for motif finding, 46% of top motifs by Zscore and 46% of top motifs
+by Frequency intersected, with manual motifs, the directory name would
+Because directories are named after parameter values, two runs with
+identical parameter values cannot be placed into same destination
+The actual temporary files consist of partitioned sequence files,
+wordspy output files, motif pattern files as inputs to ``fuzznuc``,
+fuzznuc output files, and scores that hold counts. They all
+have suffix ``.txt`` and before that a version number. All files with same
+version number are from same run of the pipeline. The wordspy outputs
+files have a leading dot, followed by parameter values for wordspy.
+There are also a log file named ``_log.txt`` and an empty file named
+``__init__.py`` "options.py" records all the parameter values for the run of
+the program, and it could result in a compiled version called
+``options.pyc`` if ``svm.py`` is called on the directory.
+The package consists of the following files:
+ instantiates various parameters and starts
+ execution. The user will only have to deal with this file to run the
+ implements versioned files.
+ randomly partitions a set of sequences from a file in FASTA format.
+ acts as a manager of pierces of this program.
+ runs fuzznuc and parses the output.
+ is a R program file that runs random forest for classification.
+ contains various utility functions for files.
+ is essentially this file.