Commits

Anonymous committed 1bdd710

Initial import of source files and documentation.

Comments (0)

Files changed (10)

+============================================================
+ DMFS: Discriminatory motif-finding based feature selection
+============================================================
+
+:Authors:
+  Hao Xiong
+  <hao@biostat.ucsf.edu>
+
+Introduction
+============
+
+This program consists of discrete steps that are chained together into
+a pipeline.  We will call the part that does partitioning, motif
+finding, and classification, the pipeline, which is also the part that
+does useful work.  The remaining part deals with multiple runs of the
+pipeline.  One running of the program could result in multiple runs of
+the pipeline with different random partitioning of data.  When we say
+one run of the program, we mean the entire program, while one run of
+the pipeline is one instance of the pipeline using one random
+partitioning.
+
+This program requires two sequence files.  One file is called positive
+file, and the other negative file.  For example, in the case of
+nucleosome occupancy, the positive file lists sequences that are known
+to be occupied by nucleosomes, the negative file has sequences that
+are not occupied.  
+
+Outline of the Pipeline
+-----------------------
+
+First, the program takes the positive sequence file and negative
+sequence file and randomly partition sequences in each file into two
+sets, one used for motif finding, the rest for classification.  The
+fraction of sequences used for motif finding must be specified on the
+command line, as well as file names for the sequence files.
+
+Second, a motif discovering program, Wordspy, is run and results are
+parsed to discover significant motifs.  We define significant motifs
+as top ranked motifs by both Zscore and frequency, two metrics in
+Wordspy's output.  The exact percentage by Zscore that are considered
+to be top must also be specified on the command line.  The same goes
+for frequency.  The intersection of two top ranked sets constitute the
+significant motifs.
+
+Next, the significant motifs are written to a pattern file, with
+possible addition of hand-picked motifs, and the pattern file is used
+as input to fuzznuc/fuzzpro in addition to the remaining sequences
+from the first step.  Fuzznuc/fuzzpro is an alignment program in
+EMBOSS package.  The output is parsed and the number of occurrence of
+each significant motif for each sequence is recorded as a table in a
+score file.  The number of maximum mismatches allowed while still be
+counted as an occurrence can be specified on the command line.
+
+Finally, a classifier is run over the score file.  In order to obtain
+AUC, 10-fold cross-validation is needed for SVM, while random forest
+has out-of-bag errors from which AUC can be calculated.  The choice of
+classifier can be determined by command line argument.  The default is
+random forest.
+
+The result is written to a file named =auc.txt= in the same directory as
+the temporary files are in (there is more explanation about output
+files later).  The file =auc.txt= records the AUC value for each run of
+the pipeline.
+
+Installation
+------------
+
+This package has been tested only under Linux, and it is likely that
+only Linux systems will work.  The OS-dependent part is the external
+programs that this package depends upon.The main obstacle is the
+program Wordspy, which only has binary distribution for Linux and Unix
+(unknown architecture/company/version), and Cygwin under Windows.
+However, since EMBOSS must be custom-compiled in Cygwin, which is
+non-trivial, Windows option is a difficult route.  MAC is wholly
+unsupported by Wordspy.  It is recommended that the user use a Linux
+system.
+
+The main issue in installation is the path to external programs, which
+must be visible to the package.  Computer systems have a variable that
+stores a list of directories that the system will search in order, to
+see whether a program is in that directory.  The list of directories
+is called search path.  One solution to path problem is to install all
+required external programs in the same directory as the package itself
+and make sure the current directory is included in the search path.
+Otherwise, add location to search path, usually represented by
+environmental variable "path" under Windows or "PATH" under
+Linux/Unix.
+
+Software Dependencies
+~~~~~~~~~~~~~~~~~~~~~
+
+This package makes use of other software packages.
+
+Wordspy
+
+  This is a motif finding program, which can be found at
+  http://cic.cs.wustl.edu/wordspy/.
+
+fuzznuc/fuzzpro
+
+  This is a sequence alignment program and a part of EMBOSS
+  package.  One must make sure that the EMBOSS package is at least
+  version 4.1 or above, as there is a bug in previous versions of
+  fuzznuc.  This program can be installed anywhere but make sure the
+  location is known to the package by adding it to search path.  
+
+R, random forest
+
+  This program requires R, a popular statistics program,
+  and the randomForest package written for R by Breiman and Cutler.  This is not
+  required if one uses SVM instead of random forest.
+
+There are also some Python packages that must be installed.
+
+Biopython
+
+  This package provides computational tools useful to
+  biologist.  Here it is used for parsing purpose.
+
+Numpy
+
+  This package has ability to deal with large, multidimensional
+  arrays.
+
+PyML
+
+  This package implements SVM in Python.  This is necessary only
+  if one intends to use SVM for the classification step.
+
+The Python packages can be installed either by system-wide package
+manager under Linux, or by easy_install which is part of Python
+install program.  To test whether they are already available, fire up
+Python interpreter and type =import Bio= for Biopython package, and
+=import numpy= for Numpy.  If there is no error message, it means those
+packages are available.  PyML will likely have to be custom-installed.
+
+Usage
+~~~~~
+
+A user only needs to call parameterize.py in order to run the program.
+A usage help can be generated by using "-h" or "--help" option.  The
+output will be similar to 
+
+Usage: parameterize.py [options]
+
+Options:
+  -h, --help            show this help message and exit
+  -p POS_SEQ, --positive=POS_SEQ
+                        the name of positive sequence file
+  -n NEG_SEQ, --negative=NEG_SEQ
+                        the name of negative sequence file
+  -f FRAC, --frac=FRAC  the fraction of sequences used for motif finding (a
+                        number between 0 and 1)
+  -m WORDSPY_MAXLN, --max-len=WORDSPY_MAXLN
+                        the maximum length of motifs that wordspy finds
+                        (minimum value is 5 for DNAs)
+  -S ZSCOREPERCENT, --top-score=ZSCOREPERCENT
+                        the fraction of top scored motifs for later
+                        consideration (a number between 0 and 1)
+  -F FREQPERCENT, --top-freq=FREQPERCENT
+                        the fraction of top frequent motifs for later
+                        consideration (a number between 0 and 1)
+  -t THREADS, --threads=THREADS
+                        the number of concurrent threads
+  -r RUNS, --runs=RUNS  the number of total runs (the number of random
+                        partitions of the data)
+  -H MANUAL, --hand-picked=MANUAL
+                        a FASTA file listing hand-picked motifs; the default
+                        is no hand-picked motifs
+  -d DEST, --dest=DEST  the directory where all the intermediate results will
+                        be written; the default is the working directory
+  -b BASE, --base=BASE  a name identifying the dataset
+  -c CLASSIFIER, --classifier=CLASSIFIER
+                        specify the classifier; the choices are SVM or RF
+                        (default RF)
+  -M NMISMATCH, --mismatch=NMISMATCH
+                        the number of mismatches allowed in matching motifs
+                        against sequences
+  -P, --protein         a flag signaling that the input sequences are protein
+                        sequences; the default is DNA sequences
+
+
+Notice that each option has two formats, the short and long ones.
+They are completely equivalent and can be mixed.  The order of options
+do not matter, as long as arguments immediately follow their options.
+For example, the name of dataset can be specified by ``-b`` option by
+either ``-b BASE`` or <code> --base=BASE </code> where ``BASE`` is the name of dataset,
+so named because it is used as the base of temporary files.
+
+The meaning of most options are already explained in the outline of
+the pipeline.  So we will only explain ``-H``, ``-d``, ``-t`` and ``-r`` here.
+``-H`` is an optional, that is, it could be left out, which means no
+hand-picked motifs.  If specified, it takes a name of a FASTA-like
+file.  ``-d`` takes the directory that will hold results and temporary
+files.  Note that each run of the program collects all the temporary
+files and results in a separate directory, so the destination
+directory will hold directories, each of which represent a single run
+of the program.  ``-r`` takes the number of runs of the pipeline, and
+``-t`` records how many threads to run simultaneously.  Recall that each
+run of the pipeline within single run of the program uses a different
+random partitioning of data but is otherwise identical.
+
+Because there are two possible classifiers but only one is used at a time, a
+separate program, svm.py, will run SVM over existing score files.  The
+results are printed to the screen.
+
+Result
+++++++
+
+The result is a file, auc.txt, that stores the AUC values of each run.
+auc.txt is put in the same directory as the temporary files are.  Note
+that the order of AUC values in auc.txt is non-deterministic so one
+cannot find matching AUC value to its corresponding temporary files.
+
+Expected Input File Format
+++++++++++++++++++++++++++
+
+The inputs are the positive and negative sequence files, with possible
+addition of hand-picked motifs.  The sequence files must be valid
+FASTA files.  
+
+The hand-picked motif file consists of lists of motifs.  Each motif
+has two lines, the first begins with a ``>`` which is immediately followed
+by the name of the motif.  The second line is the sequence of the
+motif.
+
+Temporary Files
++++++++++++++++
+
+Within the destination directory where the temporary files are kept,
+there is one directory for each run of the program through
+parameterize.py.  Each directory has a name reflecting the parameters
+used to invoke the program.  Each invocation results in one directory
+within which there are multiple, versioned temporary files that record
+multiple runs of the pipeline within the program.
+
+The directory name within the destination directory has format 
+<example>
+<BASE>_m<WORDSPY-MAXLN>_f<FRAC>_S<ZSCOREPERCENT>_F<FREQPERCENT>[_H<MANUAL>]
+</example>
+where the terms within =<>= represent the command line argument
+values, and the part within square brackets is optional.  For example,
+for ozsolak dataset, motif of maximum length 7, 72% of sequences used
+for motif finding, 46% of top motifs by Zscore and 46% of top motifs
+by Frequency intersected, with manual motifs, the directory name would
+be ``ozsolak_m7_f72_S46_F77_Hmanual.``  
+
+Because directories are named after parameter values, two runs with
+identical parameter values cannot be placed into same destination
+directory.
+
+The actual temporary files consist of partitioned sequence files,
+wordspy output files, motif pattern files as inputs to ``fuzznuc``,
+fuzznuc output files, and scores that hold counts.  They all
+have suffix ``.txt`` and before that a version number.  All files with same
+version number are from same run of the pipeline.  The wordspy outputs
+files have a leading dot, followed by parameter values for wordspy.
+
+There are also a log file named ``_log.txt`` and an empty file named
+``__init__.py``   "options.py" records all the parameter values for the run of
+the program, and it could result in a compiled version called
+``options.pyc`` if ``svm.py`` is called on the directory.
+
+Files
+~~~~~
+
+The package consists of the following files:
+
+parameterize.py
+
+  instantiates various parameters and starts
+  execution.  The user will only have to deal with this file to run the
+  program.
+
+versionedFile.py
+ 
+  implements versioned files.
+
+partseq.py
+ 
+  randomly partitions a set of sequences from a file in FASTA format.
+
+predict.py
+
+  acts as a manager of pierces of this program.
+
+runFuzz.py
+
+  runs fuzznuc and parses the output.
+
+randomForest.R
+
+  is a R program file that runs random forest for classification.
+
+file_utils.py
+
+  contains various utility functions for files.
+
+README.txt
+
+  is essentially this file.
+

src/parameterize.py

+#!/usr/bin/env python
+"""
+The parameterize module takes care of changing options and keeping the results of
+same options together by puting them into directories named with option values.
+"""
+
+from os import path, link, unlink, mkdir
+from math import floor
+from Queue import Queue
+import threading
+import sys
+from predict import predict
+
+
+class Options(object):
+    """
+    Class options tackle various options in running this pipeline.
+    """
+    def __init__(self,comln=None):
+        """        
+        Parsing command-line options and create the corresponding directory named after changable option values.
+        """
+        opt = parseOpt(comln)
+        
+        from versionedFile import ver_files, opt_dir
+        self.odir = opt_dir(opt)
+        self.files_seq = ver_files(self.odir.dir_name,opt=opt, start=1, stop=opt.runs+1)
+        # save for displaying
+        self.opt = opt
+        
+    def __str__(self):
+        opt = self.opt
+        return "%-10s\t %-24d\t %-35.2f\t %-20.2f\t %-25.2f\t %s" % (opt.base, opt.wordspy_maxln, opt.frac, opt.Zscorepercent, opt.Freqpercent, True if opt.manual else False)
+        
+def parseOpt(cml=None):
+        """
+        Parsing all the command line options and translate them into values to populate
+        the option file.
+        The return value is the parsed options.  The contents of the option file is saved.
+        """
+        from optparse import OptionParser
+        parser = OptionParser(usage='%prog [options]')
+        parser.add_option('-p', '--positive', dest='pos_seq', help='the name of positive sequence file')
+        parser.add_option('-n', '--negative', dest='neg_seq', help='the name of negative sequence file')
+        parser.add_option('-f', '--frac', dest='frac', type='float', help='the fraction of sequences used for motif finding (a number between 0 and 1)')
+        parser.add_option('-m', '--max-len', dest='wordspy_maxln', type='int', help='the maximum length of motifs that wordspy finds (minimum value is 5 for DNAs and 2 for proteins)')
+        parser.add_option('-S', '--top-score', dest='Zscorepercent', type='float', help='the fraction of top scored motifs for later consideration (a number between 0 and 1)')
+        parser.add_option('-F', '--top-freq', dest='Freqpercent', type='float', help='the fraction of top frequent motifs for later consideration (a number between 0 and 1)')
+        parser.add_option('-t', '--threads', dest='threads', type='int', help='the number of concurrent threads')
+        parser.add_option('-r', '--runs', dest='runs', type='int', help='the number of total runs (the number of random partitions of the data)')
+        parser.add_option('-H', '--hand-picked', dest='manual', type='str', help='a FASTA file listing hand-picked motifs; the default is no hand-picked motifs', default='')
+        parser.add_option('-d', '--dest', dest='dest', type='str', help='the directory where all the intermediate results will be written; the default is the working directory', default='.')
+        parser.add_option('-b', '--base', dest='base', type='str', help='a name identifying the dataset')
+        parser.add_option('-c', '--classifier', dest='classifier', type='str', default='RF',help='specify the classifier; the choices are SVM or RF (default RF)')
+        parser.add_option('-M', '--mismatch', dest='nmismatch', type='int', help='the number of mismatches allowed in matching motifs against sequences', default=2)
+        parser.add_option('-P', '--protein', dest='protein', action='store_true', help='a flag signaling that the input sequences are protein sequences; the default is DNA sequences', default=False)
+        # this enables calling this function through Python, instead of sehh
+        if cml is None:
+            opt, args = parser.parse_args()
+        else:
+            opt, args = parser.parse_args(cml)
+        # there should be nothing left after all the options are assigned
+        if len(args) != 0:
+            parser.error("incorrect number of arguments")
+        # guard against impossible values
+        elif opt.runs < 1 or opt.threads < 1:
+            parser.error("The number of concurrent threads must be greater than or equal to 1, and the number of total runs must be at least 1.")
+        elif opt.classifier not in ('SVM', 'RF'):
+            parser.error("The classifier is incorrectly specified.")
+        if not path.isdir(opt.dest):    # create the destination directory if it does not already exists
+            mkdir(opt.dest)
+        opt.wordspy_opt = '-r 0 -n 1000' if opt.protein else '-r 0 -n 300'
+        opt.score = 'Zscore'
+        opt.extraNam, opt.extraMot = get_motifs(opt.manual)
+        if not opt.protein:
+            opt.motminlen = 5              # motif minimium length
+        else:
+            opt.motminlen = 2        # for protein, minimium length is a lot shorter
+        opt.string = """base = '%s'
+pos_seq = '%s'         # the file holding sequences labeled as positive
+neg_seq = '%s'         # the file holding negative sequences
+frac = %f                             # fraction of sequences for motif finding
+wordspy_opt = '-r 0 -n 100'             # tunable options of wordspy
+wordspy_maxln = %i                       # maximum word length  for wordspy
+
+Zscorepercent = %f                     # the top percentage of motifs considered to be significant
+Freqpercent = %f
+score = 'Zscore'                        # which score to use in picking top motifs
+
+nmismatch = %d                               # the number of mismatches allowed in finding similar motifs
+motminlen = %d
+
+# extra motifs
+extraMot = %s
+extraNam = %s
+""" % (opt.base, opt.pos_seq, opt.neg_seq, opt.frac, opt.wordspy_maxln, opt.Zscorepercent, opt.Freqpercent, opt.nmismatch, opt.motminlen, opt.extraMot.__repr__(), opt.extraNam.__repr__())
+        
+        return opt
+def get_motifs(filename):
+    if filename == '':
+        return ([],[])
+    from Bio import SeqIO
+    mot_dict = SeqIO.to_dict( SeqIO.parse( open(filename), 'fasta' ) )
+    ids = mot_dict.keys()
+    seqs = [mot_dict[x].seq.tostring() for x in ids]
+    return ids, seqs
+
+def get_logger(base):
+    # set up a logger
+    import logging
+    #name = os.path.basename(base[:-1] if base[-1]=='/' else base)
+    logger = logging.getLogger(base)
+    fh = logging.FileHandler(path.join(base,"_log.txt"))
+    format = logging.Formatter(" %(message)s -- %(asctime)s ") # pre-pending name is not necessary because log already resides in that directory
+    fh.setFormatter(format)
+    logger.addHandler(fh)
+    logger.setLevel(logging.DEBUG)
+    return logger
+
+def threadRunner(nth, nrun, flSeq, protein):
+    """
+    This function manages parallel execution of different runs.
+
+    nth : the number of threads
+    nrun : the number of runs
+    flSeq : a sequence of different runs
+    """
+    results = list()
+    # set up events and queues
+    all_assigned = threading.Event()
+    task_que = Queue(nth)       # the length of task queue is set to be the number of threads
+    result_que = Queue(-1)      # the results queue is unlimited
+    thread_lst = list()
+    # constitute the worker threads
+    log = get_logger(flSeq.base)
+    for i in range(nth):
+        thread = predict(task_que, result_que, log, all_assigned, protein)
+        thread.start()
+        thread_lst.append(thread)
+    log.debug("All work threads started")
+    # put tasks into task queue
+    for task in flSeq:
+        task_que.put(task, block=True)
+    # wait for everything to finish
+    task_que.join()
+    all_assigned.set()                      # signaling all tasks are assigned
+    log.debug("All tasks assigned.")
+    for thread in thread_lst:
+        thread.join()
+    # gather the results
+    while not result_que.empty():
+        results.append(result_que.get())
+    return results
+    
+    
+    ## # write auc to an output file
+    ## predict._predict__writeFile(flnm=flSeq.base+flSeq.sep+'auc.txt', string='\n'.join(auc)+'\n') # the AUC filename could be generated by versionedFile, but for now NO.
+    ## return motifs
+def write_auc(results, fileNm):
+    string =  '\n'.join([r.auc for r in results]) + '\n'
+    predict._predict__writeFile(fileNm, string)
+        
+def _main(cmln=None):
+    try:
+        tmp= Options(cmln)
+        tmp.odir.makedir()
+        results = threadRunner(tmp.opt.threads,tmp.opt.runs,tmp.files_seq, tmp.opt.protein)
+        write_auc(results,path.join(tmp.files_seq.base,  'auc.txt'))
+    except SystemExit:                  # only print a help message
+        pass
+    except:
+        from shutil import rmtree
+        rmtree(tmp.odir.dir_name)       # maybe not necessary
+        raise
+    
+if __name__ == '__main__':
+    #command = '-p ozsolak.positive.fa -n ozsolak.negative.fa -f 0.66 -m 7 -S 1 -F 1 -t 2 -r 2 -H manual_motifs.txt --base ozsolak'.split()
+    _main()
+#! /usr/bin/env python
+"""This module partitions a set of sequences in a Fasta file format into two collections.
+The set is partitioned into two sets, the relative size of which is given as a parameter.
+
+The input is a stream comforming to Fasta format.
+For the purpose of partition, Fasta format is viewed as repeated two-liners, where the 1st
+line is started with '>' and contains gene names, ids, and other informations not germane
+to the current purpose.  The second line is a sequence made of letters A, C, G, or T.
+
+The parser is a line-oriented FSM.
+"""
+
+from Bio import SeqIO
+
+class shuffledList(list):
+    """A list that could be randomly permuted.
+
+    shuffledList is a subclass of built-in type list, which randomly permutates elements.
+    """
+    def __init__(self,*lst, **dct):
+        list.__init__(self, *lst, **dct) # this makes a copy if a list is passed in.
+        import random
+        random.shuffle(self)
+   
+TOK_DESC, TOK_SEQ, TOK_EOF = range(3)
+STAT_DESC, STAT_SEQ, STAT_EOF = range(3)
+
+class Parser(object):
+    """A simple, line-oriented parser of FASTA format.
+    A very simple view of FASTA format is taken here, for the sole purpose of partitioning
+    the sequences.  The benefit of this far more complex version is more detailed error messages.
+
+    The constuctor takes a stream.
+
+    Function get_rec returns a list of records where one record consists of two lines, the first
+    is the description line and the second is the sequence line.  The lines are not further parsed.
+
+    This parser is no longer used.  Set for removal.
+    """
+
+    import re
+    seq_rec = re.compile('[ACGT]+$',re.I)
+
+    def __init__(self, stream):
+        self.stream = stream
+        self.state = STAT_SEQ
+
+    def get_token(self):
+        """Return the next token."""
+        line = self.stream.readline()
+        if line == '':  # EOF
+            return TOK_EOF, line
+        elif line.startswith('>'):  # description line
+            return TOK_DESC, line
+        elif self.seq_rec.match(line):   # sequence line
+            return TOK_SEQ, line
+        else:
+            self.error('Parsing error: unrecognized line')
+    def error(self,msg):
+        print msg
+        import sys
+        sys.exit(1)
+
+    def transit(self):
+        token,line = self.get_token()
+        if self.state == STAT_DESC:
+            if token == TOK_SEQ:
+                self.record[1] = line
+                self.state = STAT_SEQ
+                return ''.join(self.record)
+            elif token == TOK_DESC:
+                self.error('Parsing error : Two consective description lines.\n %s' % line)
+            elif token == TOK_EOF:
+                self.error('Parsing error : Premature EOF')
+        elif self.state == STAT_SEQ:
+            if token == TOK_DESC:
+                self.record = [[]]*2  # a record has two parts, description and sequence
+                self.record[0] = line
+                self.state = STAT_DESC
+            elif token == TOK_SEQ:
+                self.error('Parsing error : Two consective sequence lines')
+            elif token == TOK_EOF:
+                self.state = STAT_EOF
+        else:
+            self.error('Invalid states')
+
+    def get_record(self):
+        allrecord = list()
+        while self.state != STAT_EOF:
+            currec = self.transit()
+            if currec is not None and self.state == STAT_SEQ:
+                allrecord.append(currec)  # using append now because sequence records are now merged.
+        return allrecord
+
+class partSeq(object):
+    """Partition a sequence file comforming to FASTA format.
+
+    The partitioning can be done in original order or shuffled order.
+
+    The partitioned sequences are writen to basename + '.part1.' + ver + '.txt' and another file with
+    part2 relplace part1 previously.  The file with 'part1' as part of its file name is the one with
+    'frac' percent of the orignal file; the one with 'part2' has the rest.
+    """
+    def _part_utl(lst, frac):
+        """A partition utility function that does the actual partition.
+        Parameters:
+        lst :  a list of strings comforming to FASTA format
+        frac   :  a fraction
+        
+        Return vales:
+        Two return values, the first partitioned set and the second
+        """
+        assert 1>frac>0, 'fraction is not a valid percentage'
+        tol = len(lst)
+        import math
+        bound = int(math.floor(tol*frac))
+        fpart = lst[:bound]
+        spart = lst[bound:]
+        return fpart, spart
+    _part_utl = staticmethod(_part_utl)
+    def __init__(self,fasta,flname, frac):
+        """Given the file name of a FASTA file and a percentage such that the sequences
+        in the file are partitioned into two sets, one having 'percentage' and other having 1 - 'percentage'.
+        """
+        self.flname = flname                # save for writing
+        fl = open(fasta, 'rU')
+        #parser = Parser(fl)
+        #orilst = parser.get_record()
+        orilst = list(SeqIO.parse(fl, 'fasta')) # using Biopython's parser
+        self.prist = self._part_utl(orilst, frac)
+        shuffled = shuffledList(orilst)
+        self.shuffled = self._part_utl(shuffled, frac)
+        fl.close()
+                        
+    def write(self, shuffled):
+        """Given an output file name and a list of sequences write the sequences to the output file.
+        It is always the case that the frac percent of the original file is returned first,
+        and then the rest of the original file as the second return value"""
+        def helper(fullnm,tmp):
+            output = open(fullnm, 'w')
+            SeqIO.write(tmp, output, 'fasta')
+            output.close()
+            
+        if shuffled:
+            fst, sec = self.shuffled # two partitions are passed as members of a list
+        else:
+            fst, sec = self.prist
+        name1,name2 = self.flname
+        helper(name1,fst)
+        helper(name2,sec)
+        return name1, name2
+
+if __name__ == '__main__':
+    import sys
+    mesg = """Usage:
+    partseq FASTA_file_name frac
+    
+    The "frac" of the file "FASTA_file_name" will be written to "FATA_file_name".part1.fa and the rest to
+    "FASTA_file_name".part2.fa."""
+    if 3 != len(sys.argv):
+        print(mesg)
+        sys.exit(1)
+    try:
+        frac = float(sys.argv[2])
+        if not (0. < frac < 1.):
+            print("The fraction must be between 0 and 1.")
+        flname = sys.argv[1]
+        nmlst = flname.split('.')
+        p = partSeq(flname, ('.'.join(nmlst[:-1]+["part1"]+nmlst[-1:]), '.'.join(nmlst[:-1]+["part2"]+nmlst[-1:])), frac)
+        p.write(shuffled=True)
+    except ValueError:
+        print("The fraction must be a real number.")
+        sys.exit(1)
+    except IOError:
+        print("The input FASTA file does not exist.")
+#!/usr/bin/env python
+"""The main program that brings together all the pieces.
+"""
+import re, os, math, threading, sys, subprocess
+import numpy as np
+
+class Result(object):
+    """
+    A stable interface to the result of one run.
+    """
+    def __init__(self, mot, auc, opt):
+        self.mot = mot
+        self.auc = auc
+        self.opt = opt
+def add_ver(meth,ver):
+    def fcn(messg):
+        meth("Run%i "%ver + messg)
+    return fcn
+
+from Queue import Empty
+from runFuzz import fuzzError
+class predict(threading.Thread):
+    """
+    This class groups all the operations into a single place and a single thread and enables multithreading.
+    """
+    def __init__(self, t_que, r_que, logger, all_done, protein, **kwargs):
+        """
+        t_que is the task queue; r_que is the result queue;
+        log is the logger;
+        all_done is an event that signals that all available tasks are done;
+        protein is a flag signaling whether the input sequences are from protein.
+        """
+        threading.Thread.__init__(self, **kwargs)
+        self.r_que = r_que             # for saving results
+        self.t_que = t_que             # task queue
+        #self.filenames = filenames
+        self.logger = logger
+        self.all_done = all_done
+        self.protein = protein
+                
+    def run(self):
+        """
+        The main function of a thread.  This method pulls from task queue filenames that are task assignments
+        and run them one by one.
+        """
+        all_done = self.all_done
+        while True:
+            if all_done.isSet() :
+                break
+            try:
+                filenames = self.t_que.get(block=True, timeout = 4)                
+            except Empty:
+                continue
+            self.debug = add_ver(self.logger.debug, filenames.ver)
+	    self.exception = add_ver(self.logger.exception, filenames.ver)
+            try:
+                self.r_que.put( self.one_run(filenames) )
+            except fuzzError, e:
+                self.exception("fuzznuc error: %s"%e)
+            except Exception, e:
+                self.exception("An uncaught exception has occurred!!! %s" % e)
+                #raise                   # comment this out to allow graceful failing
+            finally:
+                self.t_que.task_done()
+    def one_run(self, filenames):
+        """
+        The main steps of one classification.
+
+        The steps are show in comments.
+        """
+        debug = self.debug
+
+        # 1. partition positive sequences and negative sequences into two sets
+        options = filenames.options
+        pos1, pos2 = part_seq(options.pos_seq,filenames.pos_seq, options)
+        neg1, neg2 = part_seq(options.neg_seq,filenames.neg_seq, options)
+        debug( 'pos1 = %s, pos2 = %s.'% (pos1, pos2))
+        debug('neg1 = %s, neg2 = %s' % (neg1, neg2))
+
+        # 2. one set is used in discovering significant motifs
+        motifs = motif_fcn(filenames, pos1, neg1,debug, self.protein) # IT SHOULD BE pos1, neg1
+        # find out top motifs by score and frequences
+        top_score = set( find_top(motifs, options.Zscorepercent, akey=lambda x: x[options.score]) )
+        top_freq = set( find_top(motifs,options.Freqpercent,akey=lambda x: x['Freq']) )
+        motifSeq = [x['motif'] for x in (top_score & top_freq) if len(x['motif']) >= options.motminlen] # NOTE whether it is intersection or union
+        motifSeq.sort()
+        self.motifSeq = motifSeq
+        debug('Finished generating top motifs.')
+
+        # 3. scoring the witheld data-set with significant motifs
+        pat = writePattern(options.extraMot+motifSeq,options.extraNam +[None]*len(motifSeq),  filenames.patFile, options.nmismatch, self.protein) # options.extraMot+, options.extraNam +
+        debug("Finished writing pattern file to %s." % filenames.patFile)
+        # repeated use of filenames.fuzz is no problem here because it is a tempory file consumed within the function call.
+        posScores = scoreSeqs(pat.patterns, filenames.patFile, seqFile=pos2, outFlNm=filenames.fuzz, protein=self.protein)
+        debug("Finished scoring positive patterns using positive sequence file %s." % pos2)
+        negScores = scoreSeqs(pat.patterns, filenames.patFile, seqFile=neg2, outFlNm=filenames.fuzz, protein=self.protein)
+        debug("Finished scoring patterns using negative sequence file %s." % neg2)
+
+        # 4. classify the witheld data-set
+        if options.classifier == 'RF':
+            debug("Using Random Forest for classification.")
+            tmp =  self.useRF(posScores, negScores,filenames)
+        elif options.classifier == 'SVM':
+            tmp = self.useSVM(posScores, negScores,filenames)
+            debug("Using SVM for classification.")
+        debug("Finished classification using score file %s" % filenames.score)
+        return tmp
+    def useSVM(self, posScores, negScores,filenames):
+        self.writeScore(posScores, negScores, filenames)
+        from PyML import VectorDataSet,SVM, ker
+        data = VectorDataSet(filenames.score,idColumn=0, labelsColumn=-1, headerRow=True)
+        data.normalize(2)
+        svm = SVM(ker.Gaussian(0.01),C=1)                     # C=?
+        r = svm.cv(data, numFolds=10)   # 10-fold cross-validation
+        return Result(mot=self.motifSeq, auc = str(r.getROC()), opt=filenames.options)
+    def useRF(self, posScores, negScores,filenames):
+        self.writeScore(posScores,negScores,filenames)
+        randFor = subprocess.Popen('./randomForest.R %s'%filenames.score,stdout=subprocess.PIPE, shell=True )
+        randFor.wait()                  # to avoid race condition; it won't block because randomForest does not take input from this program
+        output = randFor.communicate()[0]
+        if randFor.returncode != 0:
+            raise Exception("Random forest crashed.")
+        return Result(mot=self.motifSeq, auc = output.split()[1], opt=filenames.options)
+    def writeScore(self, posScores, negScores,filenames):
+        tostring = lambda array, cat: '\n'.join(['\t'.join(map(str,t))+'\t%s'%cat for t in array.tolist()])        
+        header = list(posScores.dtype.names) + ['class']
+        header.remove('seq')
+        header = '\t'.join(header)+'\n'
+        string = header + tostring(posScores, 'pos') + '\n' + tostring(negScores, 'neg') + '\n'
+        self.__writeFile(filenames.score, string)
+        self.debug("Finished writing score file : %s" %filenames.score)
+        
+    @staticmethod
+    def __writeFile(flnm, string):
+        """Write string to flnm
+        """
+        try:
+            f = open(flnm,'w')
+            f.write(string)
+        finally:
+            f.close()
+
+    
+def writePattern(patterns, patName, patFlNm, mismat, protein):
+    """Take patterns, a pattern file name, the number of mismatches write the pattern file
+    and return the pattern object.
+    """
+    def adapt(k):
+        if k < 2:
+            return 0
+        elif k < 5:
+           return 1
+	else:
+	   return 2
+    from runFuzz import Pattern
+    # adaptive mismatches
+    if protein:
+        mismat = [adapt(len(x)) for x in patterns]
+    pat = Pattern(patterns,names = patName, mismatch=mismat)
+    pat.writeFile(patFlNm)
+    return pat
+def scoreSeqs(pat, patFlNm, seqFile, outFlNm, protein):
+    """Take patterns, a pattern file name, the number of mismatches, and sequence file name
+    run fuzznuc on them and return the scoring.
+
+    The pattern file name can be False to signify no need to write the pattern file
+    """
+    from runFuzz import  Fuzznuc, parseFuzz, count, build_table
+    # run fuzznuc and count the number of occurances for each pattern
+    fuzz = Fuzznuc(patFlNm,seqFile, outFlNm, protein)
+    outfl = fuzz.run()
+    fuzzout = parseFuzz(outfl)
+    record = fuzzout.get_record()
+    allcounts = count(record, pat)
+    data = build_table(seqFile, allcounts)
+    return data
+    
+def find_top(seq, per, akey):
+    """Given a finite sequence (nothing to do with genes) and a percentage,
+    find the top 'per' percentage of the list using given key.
+    """
+    tot = len(seq)
+    top_per = int(math.floor(tot * per))
+    return sorted(seq, key=akey)[-top_per:]
+
+
+def part_seq(fasta, filenames,opt):
+    from partseq import partSeq
+    part = partSeq(fasta,filenames, opt.frac)
+    return part.write(shuffled=True)   # set shuffled to True for random partitioning
+
+rn = re.compile(r'-r\s*(\d+)[ \w\d-]*-n\s*(\d+)')
+def motif_fcn(filenames, pos_nm, neg_nm,debug, protein):
+    options = filenames.options
+    # the output name for wordspy is the basename plus r#.n# and then the suffix
+    t1, t2 = rn.search(options.wordspy_opt).group(1,2)
+    # call wordspy program to generate motifs
+    filenames.wordspy_opts('r'+t1 + '.n'+t2+'.m'+str(options.wordspy_maxln))
+    ws_oup = filenames.wordspy
+    ws_comm = ' '.join(['wordspy',pos_nm, str(options.wordspy_maxln), ws_oup, '-e', neg_nm, options.wordspy_opt, '-f -s -b -d -c'])
+    if protein :
+        ws_comm = ws_comm + ' -a protein'
+    debug( ws_comm )
+    from subprocess import call 
+    if call(ws_comm, shell=True) != 0:
+        raise Exception("Wordspy crashed")
+        
+    from motifs import motifRec as mr
+    motifs = mr(ws_oup, protein)
+    return motifs.uniMotifs
+
+if __name__ == '__main__':
+    from parameterize import get_logger, Options
+    command = "-p SOLP.sol.fasta -n SOLP.insol.fasta -f 0.10 -m 5 -S 1 -F 0.5 -t 1 -r 1 --base SOLP --dest tmp -M 2 -P"
+    opt = Options(command.split())
+    opt.odir.makedir()                  # create the folder to hold the intermediate files
+    logger = get_logger(opt.files_seq.base)
+    flnm = opt.files_seq.next()
+    p = predict([],[], logger, False, True)
+    p.debug = add_ver(logger.debug, flnm.ver)
+    result = p.one_run(flnm)
+    

src/randomForest.R

+#!/usr/bin/env Rscript
+suppressPackageStartupMessages(library(randomForest))
+
+argv <- commandArgs( TRUE )
+score.file = argv[1]
+scores <- read.table(score.file)
+#library(randomForest, verbose=FALSE)
+scores.rf <- randomForest(class ~ ., data = scores, importance = TRUE, ntree = 500)
+if (length(argv) == 2)
+ {print(scores.rf)}
+
+# making sure that orders match between labels and votes
+tmp = scores.rf$votes
+order = match(rownames(scores), rownames(tmp))
+votes = tmp[order,]
+# take positive vote counts as the score for ROC
+pos.votes = votes[,2]
+labels = ordered(scores$class)
+suppressPackageStartupMessages(library(ROCR))
+pred = prediction(pos.votes, labels)
+auc = performance(pred, "auc")
+print(attr(auc, "y.values")[[1]])
+#!/usr/bin/env python
+
+"""Functions and classes that deal with running fuzznuc and parsing its output."""
+
+import re
+import numpy as np
+
+class fuzzError(Exception):
+    pass
+
+header = 'SeqName\tStart\tEnd\tScore\tStrand\tPattern_name\tMismatch'
+matln = re.compile(r'(\S+)\t(\d+)\t(\d+)\t\d+\t(\+|-)\t(\w+)\t(\d|\.)')
+empty = re.compile(r'\s*$')
+#seqlen = -1
+TOK_HEADER, TOK_MATCH, TOK_EOF = range(3)
+STAT_START, STAT_HEADER, STAT_MATCH = range(3)
+
+class parseFuzz(object):
+    """Parse the output of fuzznuc in excel format.
+    """
+    def __init__(self,stream):
+        self.obj = None                          # obj will hold the match object for match lines
+        self.state = STAT_START
+        self.currec = []
+        self.curseq = ''
+        self.rawlst = []
+        #self.count = dict()
+
+        #stream = string.split('\n')
+        #stream = open(fuzzfl,'r')
+        for line in stream:
+            if empty.match(line):
+                #print "skipped line: %s" % line
+                continue
+            tok = self.get_token(line.rstrip('\n'))
+            self.state = self.transit(tok)
+        self.transit(TOK_EOF)
+        stream.close()
+    def get_record(self):
+        global seqlen
+        mx = max([len(x[0]) for x in self.rawlst])
+        nx = max([len(x[4]) for x in self.rawlst])
+        dtype = np.dtype( [('SeqName', 'S%i'%mx), ('start', int), ('end', int),('strand', 'S1'), ('pattern', 'S%i'%nx), ('nmis', int)] )
+        #seqlen = mx
+        return np.array(self.rawlst, dtype=dtype)
+    
+    def get_token(self,line):
+        """There are three kinds of tokens: header line, match line, and EOF.
+        """
+        self.obj = matln.search(line)
+        if line == header:
+            return TOK_HEADER
+        elif self.obj:
+            return TOK_MATCH
+        else:
+            fuzzError('Unrecognized token.\nLine: %s' % line)
+    def transit(self,token):
+        if self.state == STAT_HEADER:
+            if token == TOK_MATCH:      # start of a record
+                lst = self.obj.groups()
+                if lst[-1]=='.' :
+                    lst = lst[:-1]+(0,)
+                self.curseq = lst[0]
+                self.currec = [lst]
+                return STAT_MATCH
+            elif token == TOK_HEADER:   # empty record
+                return STAT_HEADER
+            elif token == TOK_EOF:
+                return                  # emtpy record at the end
+        elif self.state == STAT_MATCH:
+            if token == TOK_HEADER:          # end of one record
+                #self.count[self.curseq] = len(self.currec)
+                self.rawlst.extend(self.currec)                
+                return STAT_HEADER
+            elif token == TOK_MATCH:   # middle of a record
+                lst = self.obj.groups()
+                if lst[-1]=='.' :
+                    lst = lst[:-1]+(0,)
+                assert self.curseq == lst[0], 'different sequence names within one record.'
+                self.currec.append(lst)
+                return STAT_MATCH
+            elif token == TOK_EOF:
+                self.rawlst.extend(self.currec)
+        elif self.state == STAT_START:
+            if token == TOK_HEADER:
+                return STAT_HEADER
+            else:
+                fuzzError('Unknown token at the beginning.')
+    
+        
+class Fuzznuc(object):
+    """
+    Construct commandline and pass it to fuzznuc, the result of which is returned as a stream.
+
+    A misnomer because it could also launch fuzzpro.
+    """
+    def __init__(self,patFl, seqFl, outFl, protein=False):
+        if protein == True:
+            self.cmdln = 'fuzzpro -auto -supper -rformat excel -pattern @%s -sequence %s -outfile %s' % (patFl, seqFl, outFl)
+        else :
+            self.cmdln = 'fuzznuc -auto -complement -supper -rformat excel -pattern @%s -sequence %s -outfile %s' % (patFl, seqFl, outFl)
+        self.outFl = outFl
+    def run(self):
+        from subprocess import call
+        proc = call(self.cmdln, shell=True)
+        return open(self.outFl)         # TO DO: close this file somewhere else
+class Pattern(object):
+    """Keep track of patterns that could be fed to fuzznuc.
+    """
+    def __init__(self, patterns,names=None,mismatch=0):
+        """
+        Record patterns along with their names and the number of mismatches allowed.
+        
+        Names:   If no name is given, then the default name is 'pat' appended with an integer.
+                 Names can also be partially specified, with unspecified names left as None.
+                 The unspecified names will become default name.
+        mismatch:   The default number of mismatches is 0.  Mismatch can be specified either
+                    as an integer, in which case it will be the mismatch for all patterns, or
+                    a list of integers.
+        
+        All three parameters must be of same length if provided as sequences.
+        """
+        npat = len(patterns)
+        if isinstance(mismatch, int):
+            mismatch = [mismatch]* npat
+        if names == None:
+            names = ['pattern%i' % t for t in range(npat)]
+        else:
+            # it is assumed that names is a list if specified
+            for t, x in enumerate(names):
+                if x == None:
+                    names[t] = 'pattern%i' % t
+        assert len(mismatch) == npat and len(names) == npat, 'pattern, names, and mismatch lists must be same length.'
+        patlst = ['>%s <mismatch=%i>\n%s' % (n,m,p) for n,p,m in zip(names, patterns, mismatch)]
+        self.string = '\n'.join(patlst)
+        self.patterns = names
+
+    def writeFile(self, flname):
+        try:
+            f = open(flname, 'w')
+            f.write(self.string)
+        finally:
+            f.close()
+def count(record, patterns):
+    """Count the number of motifs in each sequence.
+
+    The inputs are a list of patterns and the result of running fuzznuc based on the pattern list.  The counts include both the patterns and their complements.
+
+    Note that this function depends on implicit order of matches and them being consecutive for any sequence.
+    """
+    def onepat(rec):
+        # count the number of occurances for one pattern
+        def onestrand(one):
+            curseq = ''
+            prvend = 0
+            for y in one:
+                if y['SeqName'] == curseq:  # still looking at the same sequence
+                    if prvend < y['start']:
+                        counts[curseq] = counts[curseq] + 1
+                        prvend = y['end']
+                else:                       # starting a new sequence
+                    curseq = y['SeqName']
+                    counts[curseq] = counts[curseq] + 1 if curseq in counts else 1
+                    prvend = y['end']
+        
+        counts = dict()
+        onestrand(rec[ rec['strand'] == '+' ])
+        onestrand(rec[ rec['strand'] == '-' ])
+        return counts
+    allrec = dict()      
+    for x in patterns:
+        currec = record[record['pattern']==x]
+        allrec[x] = onepat(currec)
+    return allrec
+
+def build_table(fasta, allcounts):
+    """Build the table that has the counts of all sequences for all motifs.
+    
+    It expects a file in FASTA format which had been passed to fuzznuc as sequence
+    file and the output of the count function above.
+
+    It returns a table in form of
+    seq_id   motif_1   motif_2   ...   motif_n
+    """
+    # obtain all sequences
+    from Bio import SeqIO
+    fl = open(fasta, 'r')
+    ids = [cur.id for cur in SeqIO.parse(fl, 'fasta')] # id and name both can be used here
+    nseq = len(ids); nmot = len(allcounts)
+    seqlen = max( [len(x) for x in ids] )
+    # initial a table with zero counts and all the sequence ids
+    tmp = [(x,) + (0,)*nmot for x in ids]
+    ltype = [('seq', 'S%i'%seqlen)]+[(m, int) for m in allcounts]
+    dtype = np.dtype(ltype)
+    table = np.array(tmp, dtype=dtype)
+    # fill in the counts
+    lookup = dict(zip([ x[0] for x in tmp], range(len(tmp)))) # a table mapping sequence ids to index
+    for mot,t in ltype[1:]:
+        counts = allcounts[mot]
+        tmp = [0]*nseq
+        for seq in counts:
+            tmp[lookup[seq]] = counts[seq]
+        table[mot] = np.array(tmp,dtype=int)
+    return table
+    
+   
+if __name__ == '__main__':
+    patFile = 'pattern.txt'
+    pat = Pattern(['CACCCATACAT','RTCAYTNNNNACG'], names=['RAP1',None], mismatch=[7,5])
+    pat.writeFile(patFile)
+    
+    fuzz = Fuzznuc(patFile, 'fuzz/pos_factor13.fa', 'fuzz/fuzz.out')
+    outfl = fuzz.run()
+    fuzzout = parseFuzz(outfl)
+    record = fuzzout.get_record()
+    allcounts = count(record, pat.patterns)
+    data = build_table('fuzz/pos_factor13.fa', allcounts)
+    
+    

src/score_discrete.py

+#! /usr/bin/env python
+"""Scoring sequences using discrete motifs generated by WordSpy.
+"""
+import math
+
+def writePattern(patterns, patName, patFlNm, mismat, protein):
+    """Take patterns, a pattern file name, the number of mismatches write the pattern file
+    and return the pattern object.
+    """
+    def adapt(k):
+        if k < 2:
+            return 0
+        elif k < 5:
+           return 1
+	else:
+	   return 2
+    from runFuzz import Pattern
+    # adaptive mismatches
+    if protein:
+        mismat = [adapt(len(x)) for x in patterns]
+    pat = Pattern(patterns,names = patName, mismatch=mismat)
+    pat.writeFile(patFlNm)
+    return pat
+def scoreSeqs(pat, patFlNm, seqFile, outFlNm, protein):
+    """Take patterns, a pattern file name, the number of mismatches, and sequence file name
+    run fuzznuc on them and return the scoring.
+
+    The pattern file name can be False to signify no need to write the pattern file
+    """
+    from runFuzz import  Fuzznuc, parseFuzz, count, build_table
+    # run fuzznuc and count the number of occurances for each pattern
+    fuzz = Fuzznuc(patFlNm,seqFile, outFlNm, protein)
+    outfl = fuzz.run()
+    fuzzout = parseFuzz(outfl)
+    record = fuzzout.get_record()
+    allcounts = count(record, pat)
+    data = build_table(seqFile, allcounts)
+    return data
+    
+def find_top(seq, per, akey):
+    """Given a finite sequence (nothing to do with genes) and a percentage,
+    find the top 'per' percentage of the list using given key.
+    """
+    tot = len(seq)
+    top_per = int(math.floor(tot * per))
+    return sorted(seq, key=akey)[-top_per:]
+
+def parseOpt(cml=None):
+        """
+        Parsing all the command line options and translate them into values to populate
+        the option file.
+        The return value is the parsed options. 
+        """
+        from optparse import OptionParser
+        parser = OptionParser(usage='%prog [options]')
+        parser.add_option('-w', '--wordspy', dest='wordspy', help='the name of the WordSpy output file')
+        parser.add_option('-p', '--positive', dest='pos_seq', help='the name of positive sequence file')
+        parser.add_option('-n', '--negative', dest='neg_seq', help='the name of negative sequence file')
+        parser.add_option('-S', '--top-score', dest='Zscorepercent', type='float', help='the fraction of top scored motifs for later consideration (a number between 0 and 1)')
+        parser.add_option('-F', '--top-freq', dest='Freqpercent', type='float', help='the fraction of top frequent motifs for later consideration (a number between 0 and 1)')
+        parser.add_option('-M', '--mismatch', dest='nmismatch', type='int', help='the number of mismatches allowed in matching motifs against sequences', default=2)
+        parser.add_option('-P', '--protein', dest='protein', action='store_true', help='a flag signaling that the input sequences are protein sequences; the default is DNA sequences', default=False)
+        if cml is None:
+            opt, args = parser.parse_args()
+        else:
+            opt, args = parser.parse_args(cml)
+        # there should be nothing left after all the options are assigned
+        if len(args) != 0:
+            parser.error("incorrect number of arguments")
+        return opt
+ 
+def writeScore(posScores, negScores,filename):
+        try:
+            tostring = lambda array, cat: '\n'.join(['\t'.join(map(str,t))+'\t%s'%cat for t in array.tolist()])        
+            header = list(posScores.dtype.names) + ['class']
+            header.remove('seq')
+            header = '\t'.join(header)+'\n'
+            string = header + tostring(posScores, 'pos') + '\n' + tostring(negScores, 'neg') + '\n'
+            f = open(filename, 'w')
+            f.write(string)
+        finally:
+            f.close()
+
+if __name__ == '__main__':
+    opt = parseOpt()
+    motminlen = 2 if opt.protein else 5
+    # parse WordSpy output
+    from motifs import motifRec as mr
+    motifs = mr(opt.wordspy, opt.protein).uniMotifs
+    # select top motifs
+    top_score = set( find_top(motifs, opt.Zscorepercent, akey=lambda x: x['Zscore']) )
+    top_freq = set( find_top(motifs,opt.Freqpercent,akey=lambda x: x['Freq']) )
+    motifSeq = [x['motif'] for x in (top_score & top_freq) if len(x['motif']) >= motminlen] # NOTE whether it is intersection or union
+    motifSeq.sort()
+    # write out patterns
+    tmp = opt.wordspy.split('.')
+    patFile = ".".join( ['pattern'] + tmp[1:] )
+    pat = writePattern(motifSeq,[None]*len(motifSeq), patFile, opt.nmismatch, opt.protein)
+    # scoring sequences using Fuzznuc/Fuzzpro
+    fuzzFile = ".".join(['fuzz'] + tmp[1:])
+    posScores = scoreSeqs(pat.patterns, patFile, seqFile=opt.pos_seq, outFlNm=fuzzFile, protein=opt.protein)
+    negScores = scoreSeqs(pat.patterns, patFile, seqFile=opt.neg_seq, outFlNm=fuzzFile, protein=opt.protein)
+    # write out the score file
+    scoreFile = ".".join(['score'] + tmp[1:])
+    writeScore(posScores, negScores, scoreFile)

src/score_pfms.py

+#! /usr/bin/env python
+from __future__ import with_statement
+
+import sys, getopt
+import MOODS
+from Bio import SeqIO
+from Bio.Seq import Seq
+from Bio.SeqRecord import SeqRecord
+from Bio.Alphabet import generic_dna
+import math
+
+def revcomp(seq):
+#================
+	rc = []
+	for base in seq:
+		if base in ['A','a']:
+			rc.append('t')
+		elif base in ['T','t']:
+			rc.append('a')
+		elif base in ['C','c']:
+			rc.append('g')
+		elif base in ['G','g']:
+			rc.append('c')
+		else:
+			rc.append('n')
+	rc.reverse()
+	return ''.join(rc)
+
+def parse_wordspy_pfms(wordspyfile):
+#====================================
+	'''Parse Wordspy PFMs. Create a matrix of PFMs and a list of
+	motif names (with corresponding indices).'''
+	pfms = []
+	motifs = []
+	
+	f = open(wordspyfile, 'r')
+	for line in f:
+		entry = line.strip().split()
+		
+		if len(entry) < 1:
+			continue
+		
+		if entry[0] == 'Weight':
+			name = previous
+		elif entry[0] == 'a:':
+			matrix = []
+			for i in range(1,len(entry)):
+				entry[i] = float(entry[i])
+			matrix.append(entry[1:len(entry)])
+		elif entry[0] == 'c:':
+			for i in range(1,len(entry)):
+				entry[i] = float(entry[i])
+			matrix.append(entry[1:len(entry)])
+		elif entry[0] == 'g:':
+			for i in range(1,len(entry)):
+				entry[i] = float(entry[i])
+			matrix.append(entry[1:len(entry)])
+		elif entry[0] == 't:':
+			for i in range(1,len(entry)):
+				entry[i] = float(entry[i])
+			matrix.append(entry[1:len(entry)])
+			if name.lower() not in motifs:
+				pfms.append(matrix)
+				motifs.append(name.lower())
+		else:
+			previous = entry[0]
+	f.close()
+	
+	return pfms, motifs
+
+def read_fasta(fastafile):
+#==========================
+	'''Read input sequences.'''
+	fasta = []
+	f = open(fastafile, 'rU')
+	for record in SeqIO.parse(f,'fasta'):
+		fasta.append(record)
+	f.close()
+	for seq in fasta:
+		seq.seq = seq.seq.lower()
+	return fasta
+
+def score_sequences(seqs,motifs,pfms,pval):
+#==========================================
+	''' NOTE: MOODS.search() method automatically converts count 
+	matrix into log-odds matrix and then finds matches. However, 
+	this conversion ONLY OCCURS IF THE THRESHOLD IS GIVEN AS A PVALUE. 
+	Calling MOODS with an absolute threshold is not appropriate here.
+	Default pval: 0.01.'''
+	
+	res = []
+	for entry in seqs:
+		fwd_matches = MOODS.search(entry.seq, pfms, pval)
+		rev_matches = MOODS.search(revcomp(entry.seq),pfms,pval)
+		# Code for calculating counts of matches
+		#for i in range(len(motifs)):
+		#	counts.append(str(len(fwd_matches[i])+len(rev_matches[i])))
+		
+		# Calculates sum of scores for all matches for a given PFM 
+		#(instead of counts)
+		scores = []
+		for i in range(len(fwd_matches)):
+			sum = 0
+			for entry in fwd_matches[i]:
+				sum += entry[1]
+			for entry in rev_matches[i]:
+				sum += entry[1]
+			scores.append(str(sum))
+		res.append(scores)
+		
+	return res
+
+def write_scores(res,motifs,seqs, cat):
+#======================================
+	out = open('pfm_scorefile.txt','w')
+	out.write('\t'.join(motifs+['Class']))
+	out.write('\n')
+	for i in range(len(res)):
+		out.write('%s\t' %seqs[i].name)
+		out.write('\t'.join(res[i]))
+		out.write('\t' + cat)
+		out.write('\n')
+	out.close()
+
+def scores2str(res, seqs, cat):
+#==============================
+	"""
+	Convert results and sequences to strings
+	"""
+	from itertools import izip
+	for seq, r in izip(seqs, res):
+		yield str(seq.name) + '\t' + '\t'.join(r) + '\t' + cat + '\n'
+	#return "\n".join([str(seqs[i].name)+'\t'+'\t'.join(res[i]) for i in range(len(res))])
+		
+def usage():
+#===========
+	print '''
+==================================================================
+USAGE: python score_wordspy_pfms.py [options]
+
+  Options:
+  -w	--wordspyfile
+  -f	--fastafile
+  -n	--neg
+  -p	--pvalue	
+ 
+==================================================================
+'''
+
+
+if __name__ == '__main__':
+#=========================
+	# Extract command line arguments.
+	try:
+		opts, args = getopt.getopt(sys.argv[1:], 'hw:f:n:p:', 
+		["help","wordspyfile=","fastafile=","neg=","pvalue="]) 
+	except getopt.GetoptError, err:
+		print str(err)
+		usage()
+		sys.exit()
+	
+	# Initialize variables.
+	wordspyfile = None
+	fastafile = None
+	pval = 0.01
+	
+	# Assign arguments to variables.
+	for o,a in opts:
+		if o in ('-h','--help'):
+			usage()
+			sys.exit()
+		elif o in ('-w','--wordspyfile'):
+			wordspyfile = a
+		elif o in ('-f','--fastafile'):
+			fastafile = a
+		elif o in ('-n', '--neg'): # file holding sequences of negative class
+			negfile = a
+		elif o in ('-p','--pvalue'):
+			pval = float(a)
+		else:
+			usage()
+			sys.exit()
+	
+	# Check variables
+	if wordspyfile == None or fastafile == None or isinstance(pval,float) == False:
+		usage()
+		sys.exit()
+		
+	# Execute
+	pfms, motifs = parse_wordspy_pfms(wordspyfile)
+	
+	with open('pfm_scorefile.txt','w') as outf:
+		## write the header first
+		outf.write('\t'.join(motifs+['Class'])+'\n')
+		## write scores positive and negative classes
+		seqs = read_fasta(fastafile)
+		res = score_sequences(seqs,motifs,pfms,pval)
+		outf.writelines( scores2str(res, seqs, 'pos') )
+		seqs = read_fasta(negfile)
+		res = score_sequences(seqs, motifs, pfms, pval)
+		outf.writelines( scores2str(res, seqs, 'neg') )
+	#write_scores(res,motifs,seqs, fastafile[:3])
+
+#!/usr/bin/env python
+"""
+Varioous utilities.
+"""
+
+from Bio import SeqIO
+import math
+from itertools import islice
+
+def extract_seq(seq_file, perc=0, count=0):
+    """
+    Extract some sequences from the front by counts or percentage.
+
+    Parameters:
+    -----------------
+    seq_file : The sequence file in FASTA format.
+    perc : The percentage of sequences that will be extracted.
+    count : The number of sequences that will be extracted.
+
+    Returns:
+    -----------------
+    A list of sequences
+    """
+
+    def by_perc(f):
+        assert perc < 100, "The percentage cannot be > 100%."
+        seq_lst = list(f)
+        cnt = len(seq_lst)
+        num = int((perc/100.)*cnt)
+        return seq_lst[:num]
+
+    def by_count(f):
+        return list(islice(f, count))
+    assert perc==0 or count==0, "Percentage and count cannot be both set."
+    with open(seq_file) as fl:
+        if perc > 0:
+            return by_perc(SeqIO.parse(fl, 'fasta'))
+        elif count > 0:
+            return by_count(SeqIO.parse(fl, 'fasta'))
+        else:
+            raise Exception("One of count and percentage must be positive.")
+            
+def write_seq(seqs, flnm):
+    """
+    Write sequences to a file in FASTA format.
+    """
+    with open(flnm, 'w') as fl:
+        SeqIO.write(seqs, fl, 'fasta')

src/versionedFile.py

+#! /usr/bin/env python
+"""
+Classes dealing with versioned directories and file names for intermediate results.
+
+ver_files sets up a sequence of versioned file names, each of which is a fileName object.
+"""
+
+import os.path
+
+class opt_dir(object):
+    """
+    This class is the link between command-line options and setting up of the directory holding
+    intermediate results.
+
+    Given an option object, this class constructs a directory name reflecting values of changable
+    parameters, creates the said directory, and writes an empty __init__ file and options.py into it.
+    """
+    def __init__(self, opt):
+        from math import floor
+        self.dir_name = os.path.join(os.path.abspath(opt.dest), opt.base + '_m%d' %opt.wordspy_maxln + '_f%d' % floor(100*opt.frac) + '_S%d' % floor(100*opt.Zscorepercent) + '_F%d' % floor(100*opt.Freqpercent) + ('_Hmanual' if opt.manual else '') )
+        self.opt = opt
+    def makedir(self):
+        # check if the directory exists
+        if os.path.isdir(self.dir_name):
+            raise OSError("directory already exists:%s" % self.dir_name)
+        else:
+            os.mkdir(self.dir_name)
+        # put an empty init file and options.py to facilitate imports
+        from predict import predict
+        predict._predict__writeFile( os.path.join(self.dir_name,'__init__.py'),'' )
+        predict._predict__writeFile( os.path.join(self.dir_name, 'options.py'),self.opt.string )
+        
+    @staticmethod
+    def parse(dname):
+        """
+        parse a directory name constructed by opt_dir.
+
+        This function is put within opt_dir to centralize option directory handlings.
+        Obsolete because command-line options are now directly parsed to obtain options.
+        """
+        
+        dname = (dname[:-1] if dname[-1] == '/' else dname) # we don't want path.split to think everything is a directory
+        head,name = os.path.split()
+        things = name.split('_')
+        opt = dict()
+        opt['base'] = things[0]
+        opt['manual'] = True if things[-1][0] == 'H' else False
+        return opt['base'], opt['manual']
+        
+        
+class fileName(object):
+    """An object of this class holds the basename and version for various derived file names.
+    For now only sequence files and wordspy.
+    """
+    def __init__(self, base, ver, opt):
+        """Taking basename and version as parameters.
+        The basename can have directory name as well.
+        """
+        
+        self.pos_seq = [os.path.join(base, 'pos' + x + '.%i'%ver + '.txt') for x in ('.part1','.part2')] # two file-names for partitioning positive sequence file
+        self.neg_seq = [os.path.join(base, 'neg' + x + '.%i'%ver + '.txt') for x in ('.part1','.part2')] # two file-names for partitioning negative sequence file
+        self.patFile = os.path.join(base, 'pattern.' + '%i' %ver +'.txt') # pattern file name for running fuznuc
+        self.score = os.path.join(base, 'score' + '.%i' % ver + '.txt')
+        self.fuzz = os.path.join(base, 'fuzz' + '.%i' % ver + '.txt')
+        self.base = base
+        self.ver = ver
+        self.options = opt
+
+    def wordspy_opts(self, opts):
+        """For wordspy, this method incorporates commandline options as part of the file name.
+        """
+        self.wordspy = os.path.join(self.base,'wordspy.%s' % opts + '.%i'%self.ver + '.txt')
+
+class ver_files(object):
+    """This class generates a sequence of versioned file names.
+    """
+    def __init__(self, base,opt,start=0, stop = 1):
+        """The constructor takes basename and the start and end value of file version, so that
+        the file-name looks like 'basename.ver' where 'ver' is the version of the file-name.
+        
+        Note that this class follows python convension that the end value is not returned as
+        a valid version number.
+        """
+        assert stop > start, 'the stop value must be greater than the start value'
+        assert isinstance(start, int) and start >0, 'the starting value must be a positive integer' 
+        self.base = base
+        self.options = opt
+        self.ver = start
+        self.ter = stop
+        
+        
+    def next(self):
+        ver = self.ver
+        if ver < self.ter:
+            self.ver = self.ver + 1
+            return fileName(self.base, ver, self.options)
+        else:
+            raise StopIteration
+    def __iter__(self):
+        return self
+