Commits

Clint Howarth  committed 5fc38c0

improve packaging and documentation, migrate source control

  • Participants

Comments (0)

Files changed (52)

+syntax: glob
+
+genepidgin-docs
+
+dist/
+*egg-info*
+
+.gitignore
+.git
+*.elc
+*.pyc
+*.pyc$
+*.swp
+**/.git/**
+**/.svn/**
+*~
+.*~
+*\#
+*.DS_Store
+See docs/credits.rst
+v1.1
+Added support for blast -m 8 format. See the revised Input section of the documentation.
+
+v1.01
+Added floor of 0.0 for evalue for hmmer results. Score should come into play more often for good matches.
+
+v1.00
+Initial release.
+See docs/changes.rst
+genepidgin Copyright (c) 2012, Broad Institute All rights reserved.
+
+(this is a BSD-style license)
+
+Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
+
+Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
+
+Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.
+
+Neither the name of the Broad Institute nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE BROAD INSTITUTE ''AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE BROAD INSTITUTE BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+short answer: BSD
+
+see docs/license.rst
+include AUTHORS.txt
+include CHANGES.txt
+include LICENSE.txt
+include README.rst
+recursive-include genepidgin/test/data *
+==========
+GENEPIDGIN
+==========
+
+Genepidgin is a suite of tools that assist in evaluation and assignment gene product names. There are three primary components:
+
+``genepidgin *cleaner*``
+    standardizes gene names per UNIPROT naming guidelines
+``genepidgin *compare*``
+    compares two or more sets of gene names
+``genepidgin *select*``
+    selects the most appropriate product name from a vareity of homology evidence
+
+For more information, including current development status, please visit `genepidgin's documentation`_.
+
+.. _genepidgin's documentation: http://genepidgin.readthedocs.org/
+

File docs/Makefile

+# Makefile for Sphinx documentation
+#
+
+# You can set these variables from the command line.
+SPHINXOPTS    =
+SPHINXBUILD   = sphinx-build
+PAPER         =
+BUILDDIR      = ../genepidgin-docs
+
+# Internal variables.
+PAPEROPT_a4     = -D latex_paper_size=a4
+PAPEROPT_letter = -D latex_paper_size=letter
+ALLSPHINXOPTS   = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
+# the i18n builder cannot share the environment and doctrees with the others
+I18NSPHINXOPTS  = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) .
+
+.PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest gettext
+
+help:
+	@echo "Please use \`make <target>' where <target> is one of"
+	@echo "  html       to make standalone HTML files"
+	@echo "  dirhtml    to make HTML files named index.html in directories"
+	@echo "  singlehtml to make a single large HTML file"
+	@echo "  pickle     to make pickle files"
+	@echo "  json       to make JSON files"
+	@echo "  htmlhelp   to make HTML files and a HTML help project"
+	@echo "  qthelp     to make HTML files and a qthelp project"
+	@echo "  devhelp    to make HTML files and a Devhelp project"
+	@echo "  epub       to make an epub"
+	@echo "  latex      to make LaTeX files, you can set PAPER=a4 or PAPER=letter"
+	@echo "  latexpdf   to make LaTeX files and run them through pdflatex"
+	@echo "  text       to make text files"
+	@echo "  man        to make manual pages"
+	@echo "  texinfo    to make Texinfo files"
+	@echo "  info       to make Texinfo files and run them through makeinfo"
+	@echo "  gettext    to make PO message catalogs"
+	@echo "  changes    to make an overview of all changed/added/deprecated items"
+	@echo "  linkcheck  to check all external links for integrity"
+	@echo "  doctest    to run all doctests embedded in the documentation (if enabled)"
+
+clean:
+	-rm -rf $(BUILDDIR)/*
+
+html:
+	$(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html
+	@echo
+	@echo "Build finished. The HTML pages are in $(BUILDDIR)/html."
+
+dirhtml:
+	$(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml
+	@echo
+	@echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml."
+
+singlehtml:
+	$(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml
+	@echo
+	@echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml."
+
+pickle:
+	$(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle
+	@echo
+	@echo "Build finished; now you can process the pickle files."
+
+json:
+	$(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json
+	@echo
+	@echo "Build finished; now you can process the JSON files."
+
+htmlhelp:
+	$(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp
+	@echo
+	@echo "Build finished; now you can run HTML Help Workshop with the" \
+	      ".hhp project file in $(BUILDDIR)/htmlhelp."
+
+qthelp:
+	$(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp
+	@echo
+	@echo "Build finished; now you can run "qcollectiongenerator" with the" \
+	      ".qhcp project file in $(BUILDDIR)/qthelp, like this:"
+	@echo "# qcollectiongenerator $(BUILDDIR)/qthelp/Genepidgin.qhcp"
+	@echo "To view the help file:"
+	@echo "# assistant -collectionFile $(BUILDDIR)/qthelp/Genepidgin.qhc"
+
+devhelp:
+	$(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp
+	@echo
+	@echo "Build finished."
+	@echo "To view the help file:"
+	@echo "# mkdir -p $$HOME/.local/share/devhelp/Genepidgin"
+	@echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/Genepidgin"
+	@echo "# devhelp"
+
+epub:
+	$(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub
+	@echo
+	@echo "Build finished. The epub file is in $(BUILDDIR)/epub."
+
+latex:
+	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
+	@echo
+	@echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex."
+	@echo "Run \`make' in that directory to run these through (pdf)latex" \
+	      "(use \`make latexpdf' here to do that automatically)."
+
+latexpdf:
+	$(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex
+	@echo "Running LaTeX files through pdflatex..."
+	$(MAKE) -C $(BUILDDIR)/latex all-pdf
+	@echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex."
+
+text:
+	$(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text
+	@echo
+	@echo "Build finished. The text files are in $(BUILDDIR)/text."
+
+man:
+	$(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man
+	@echo
+	@echo "Build finished. The manual pages are in $(BUILDDIR)/man."
+
+texinfo:
+	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
+	@echo
+	@echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo."
+	@echo "Run \`make' in that directory to run these through makeinfo" \
+	      "(use \`make info' here to do that automatically)."
+
+info:
+	$(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo
+	@echo "Running Texinfo files through makeinfo..."
+	make -C $(BUILDDIR)/texinfo info
+	@echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo."
+
+gettext:
+	$(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale
+	@echo
+	@echo "Build finished. The message catalogs are in $(BUILDDIR)/locale."
+
+changes:
+	$(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes
+	@echo
+	@echo "The overview file is in $(BUILDDIR)/changes."
+
+linkcheck:
+	$(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck
+	@echo
+	@echo "Link check complete; look for any errors in the above output " \
+	      "or in $(BUILDDIR)/linkcheck/output.txt."
+
+doctest:
+	$(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest
+	@echo "Testing of doctests in the sources finished, look at the " \
+	      "results in $(BUILDDIR)/doctest/output.txt."

File docs/changes.rst

+=======
+Changes
+=======
+
+1.1
+  revised initial release, better python structure, packaging, and documentation
+
+1.0
+  initial public release

File docs/cleaner.rst

+Genepidgin *cleaner*
+====================
+
+**Note**: this logic in particular is from a different era of computes, and you'd almost certainly be better off with a homology-derived name against a tightly-governed protein library than a loose alignment against a less controlled one. Use GO.
+
+Genepidgin *cleaner* standardizes the format of gene product names derived from diverse databases, including FIGfam, KEGG, Pfam, RefSeq, SwissProt and TIGRFAM. It's the product of many years of production genome annotation.
+
+This software package consists of a large collection of heuristics, formatting rules and regular expressions which are designed to take a name from any of Genepidgin's supported databases and present it in a common style. Though our regexp library is large, it is not infinite; thus, Genepidgin *cleaner* cannot detect every possible name error. However, the vast majority of source names end up better and more informative for having gone through Genepidgin *cleaner*.
+
+Goals
+-----
+
+- Names should agree with the prevailing conventions in cases where such conventions can be easily identified and agreed upon.
+- Names should be as clear and concise as possible.
+- Names should not be descriptive phrases that define function (for example, "protein involved in folding" is not useful, but "chaperonin" is).
+- Names should not include programmatic references.
+- Names should be derived from high-confidence alignments to homologous proteins. Names generated by Genepidgin, once deposited in public databases, may themselves be used as a basis to name other genes transitively. To prevent the propagation of incorrect product names, only high-confidence alignments should be used for naming.
+- Prefer no name or an obviously generic name to an uninformative name.
+- Prefer lowercase words in everything but acronyms and proper names.
+- Prefer a standardized expression of common protein names.
+- Prefer American English spelling.
+- Use only 7-bit ASCII characters, so that names render correctly on every computing platform.
+
+Steps in Filtering Process
+--------------------------
+
+The following list is a rough description of the steps involved in processing a name. This list is not a literal description of the layout of the code, but rather a high-level overview of how Genepidgin *cleaner* works.
+
+whole name filtering and deletion
+    Sometimes, names are published into the global protein namespace
+    that are obviously the output of a malformed SQL query or
+    accidentally copied Excel spreadsheet. We process these before doing
+    anything else, extracting useful information when possible.
+typo correction
+    People misspell (for example) *hypothetical* and *transporter* in
+    many, many ways. Correcting these names early prevents later filters
+    from missing human-obvious corrections.
+uninformative clause removal
+    Subclauses that are globally uninformative are removed. For example,
+    documented proteins should not have their functions described within
+    their names, so phrases like "X involved with Y" simply become "X".
+clause replacement
+    The largest transformations happen here, where names are changed to
+    become more consistent. For example, the phrase "transport family
+    protein" becomes "transporter".
+organism names
+    The vast majority of the time, specific organism names are not
+    informative when copied across species by homology or alignment. We
+    remove them.
+id removal
+    Many published genes have obvious database ids. Genepidgin does not
+    transitively assign these to new gene annotations.
+punctuation cleanup
+    Removing ids and other phrases often leaves bad punctuation and/or
+    leftover parentheses, which then must be themselves removed.
+standardize format
+    The grammatical structures of product names are improved late in the
+    cleaner process. This category assumes that by this point, the name
+    is a keeper, and simply reformats it for consistent presentation.
+final sanity check
+    If, after filtering, the entire name is otherwise uninformative such
+    as "CDS" or "small secreted protein", then the name is misleading
+    and will be dropped.
+capitalization
+    Finally, Genepidgin tries to establish consistent capitalization: only
+    proper names and acronyms are capitalized.
+
+How to Use Genepidgin *cleaner*
+---------------------------
+
+All files used as input and output are in the `Simple Name File Format <#simple>`_.
+
+via the command-line
+~~~~~~~~~~~~~~~~~~~~
+
+- **``cleaner``** takes a name and applies the full list of filters to it. A name can be filtered to an empty string by this function; the output of the command will tell you why. Names that are filtered to nothing are ones Genepidgin considers to be uninformative.
+
+   ::
+
+       $ genepidgin cleaner <inputfile>
+
+Setting the -d flag indicates that genepidgin should return a default name (``"hypothetical protein"``) when a name would otherwise be blank.
+
+usage doc
+^^^^^^^^^
+
+::
+    $ genepidgin cleaner -h
+
+    usage: cmdline.py cleaner [-h] [--silent] [--default] input output
+
+    positional arguments:
+      input          filename with names to clean
+      output         output file
+
+    optional arguments:
+      -h, --help     show this help message and exit
+      --silent, -s   display etymology to stdout during compute
+      --default, -d  return default name "hypothetical project" when names filter
+                     to nothing, else return emptry string
+
+via Python
+~~~~~~~~~~
+
+From inside your python shell, let's set up your first test case.
+
+::
+
+    >>> import pidgin.cleaner
+    >>> bname = pidgin.cleaner.BioName()
+    >>> name = "BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]"
+
+Instatiating ``BioName`` compiles a couple hundred regular expressions.  Instantiating a new ``BioName`` object for every name to be changed can get expensive. A single ``BioName`` object can reformat any number of names, so callers need only instantiate the class once.
+
+Under the hood, ``cleaner`` calls on either ``filter`` or ``cleanup``. When everyone had different default names, this distinction was more meaningful, but now everyone follows UNIPROT's *hypothetical protein* standard.
+
+This name contains a great deal of spurious and unreliable information.  A quick ``cleaner`` of this name...
+
+::
+
+    >>> cleaned = bname.filter(name)
+    >>> print cleaned
+    "glycine/betaine/L-proline ABC transporter"
+
+To see what happened during the filter process, we set ``getOutput`` to true when we call ``filter``. Note the additional returned value.
+
+::
+
+    >>> (cleaned, process_string) = bname.filter(name, getOutput=1)
+    >>> print cleaned
+    "glycine/betaine/L-proline ABC transporter"
+    >>> print process_string
+    filtered name in 5 steps:
+    0) original: BT002689 glycine/betaine/L-proline ABC transport protein, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]
+    1)   reason: transport protein -> transporter
+        pattern: \btransport(er)?\s+protein\b
+       filtered: BT002689 glycine/betaine/L-proline ABC transporter, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]
+    2)   reason: id
+        pattern: \b[A-Za-z0-9]+\d{4,}(?<!\b(?:DUF|UPF)\d{4})\b(?!\s*(kD(a)?|-like|family|protein\s+family))
+       filtered:  glycine/betaine/L-proline ABC transporter, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]
+    3)   reason: delete spaces at beginning of name
+        pattern: ^\s+
+       filtered: glycine/betaine/L-proline ABC transporter, periplasmic-binding protein [Desulfovibrio desulfuricans subsp. desulfuricans str. G20]
+    4)   reason: delete closing brackets at end of name
+        pattern: (?:\[[^]]*)\]\s*$
+       filtered: glycine/betaine/L-proline ABC transporter, periplasmic-binding protein
+    5)   reason: delete notes after commas, dashes, semicolon--except when followed by family or superfamily
+        pattern: [-,;]\s+(?!family)(?!superfamily).*
+       filtered: glycine/betaine/L-proline ABC transporter
+
+(Note that ``process_string`` is a single multiline string, which looks good when ``print``'ed but bad when simply exported.)
+
+Reference the documentation in the code for more information on parameters. It's fairly well commented, if not clear.
+
+.. note:: Please see :doc:`credits` for contributor information.
+

File docs/compare.rst

+Genepidgin *compare*
+====================
+
+Genepidgin *compare* uses a combination of edit distance and longest-common-substring calculations to estimate the degree of similarity between two or more protein names.
+
+Algorithm
+---------
+
+To compare two names, we
+
+#. decompose each name into tokens,
+#. remove uninformative tokens,
+#. rearrange the tokens in such a way as to...
+   -  minimize the edit distance between them, and
+   -  maximize the length of common token substrings
+#. report a single number between 0 and 1 (inclusive) summarizing the distance between the two names.
+
+In more detail:
+
+#. decompose each name into tokens
+
+   First, we split the names up by spaces, remove EC numbers and punctuation and other sorts of extra characters, convert everything to lowercase, etc.
+
+   **in:** "Ribosomal protein, S23-type"
+    **out:** "ribosomal" · "protein" · "s23-type"
+
+#. remove uninformative tokens
+
+   In this step we strike out words that are only useful in a grammatical sense, including *an, and, in, is, of, the,* etc. We also remove weasel words, such as *generic, hypothetical, related,* etc.  Finally, we remove glue words, such as *associated, class, component, protein, system,* and *type.* When these words are stripped we are left with a "core" name that identifies the protein; different namers may use different glue words to format the core name and we ignore those.
+
+   **in:** "ribosomal" · "protein" · "s23-type"
+    **out:** "ribosomal" · "s23"
+
+   Because we strip out noninformative tokens, we count all of the following strings as equal.
+
+   -  "predicted protein"
+   -  "putative protein"
+   -  "hypothetical protein"
+   -  "conserved hypothetical protein"
+
+#. rearrange the tokens in such a way as to ...
+
+   Finding the best edit distance between two names of, say, 4 tokens each is a bit tricky, because it's possible that the lowest cumulative edit distance will involve one or more sub-optimal individual token matches. In fact there are cases where the lowest distance is composed entirely of sub-optimal token pairings. So we need to try a lot of combinations. To do this we precompute two scores for each pair of tokens, and build two *n* × *n* matrices to hold them. We then score all possible paths with distinct pairwise token pairings via these matrices. For each path we combine two scores: we try to minimize the normalized edit distance between token pairs, and we try to maximize the length of the longest pairwise common substrings between pairs of tokens.
+
+   In one matrix, we store the pairwise token-token edit distance, using the `Damerau-Levenshtein distance <http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance>`_, leveraging the excellent Python implementation by `Michael Homer <http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/>`_.  We normalize the edit distance by dividing it by the number of characters in the longer token. The other ``*n* × *n*`` matrix holds the length of the longest common substring between each pair of tokens.  Our LCS finder is similar to that published on the `Wikipedia <http://en.wikipedia.org/wiki/Longest_common_substring>`_.
+
+   In the case where the protein names have different numbers of tokens, we build square matrices from the largest dimension, padding the shorter dimension with empty tokens. There also are heuristics to handle cases where a token in one name is composed of two or more tokens in the other. The special handling for these special cases is too detailed for this document; see the source or `contact <contributing>`_ the authors for details.
+
+   Note that token order has no effect on the distance between two names.
+
+#. report a single number between 0 and 1 (inclusive) summarizing the distance between the two names.
+
+   A perfect token-token match is really good. A lot of perfect matches are really, really good. Long common substrings are fairly good. The Damerau-Levenshtein distance can return higher distances than we might like for these three types of token matches. On the other hand, maximizing the length of the longest common substring(s) has its own set of problems. After a great deal of trial and error, we have settled on the following equation, which has worked well on genome-scale scoring studies across a variety of prokaryotes.
+
+   ::
+
+       "Genepidgin" distance =
+           SUM(per-token normalized edit distance) *
+           (1 - (SUM(per-token LCS length) / LENGTH(longer name))) *
+           (1 / COUNT(compared tokens))
+
+   The first line of this distance metric weights each pair of tokens equally. Thus a "SecG" · "SecG" match counts just as much as a "phosphoribosylglycinamide" · "phosphoribosylglycinamide" match.
+
+   The second line of the metric weights each character equally, thereby lowering the distances between long tokens that differ only slightly, for example
+
+   2,3,4,5-tetrahydropyridine-2,6-dicarboxylate
+   2,3,4,5-tetrahydropyridine-2-carboxylate
+
+   The third line of the distance metric above simply normalizes the score from 0 to 1. A distance of 0 indicates the names have identical information content and are essentially equivalent. A distance of 1 indicates the names have nothing in common.
+
+How to use Genepidgin *compare*
+-------------------------------
+
+Given at least two input files, one reference and one or more queries, score the distance (using ``genepidgin.distance.DistanceTool()``) between the names found in the files.
+
+::
+
+    genepidgin compare (options) <reference_file> <query_file> [<query_file2> ...]
+
+    options:
+      --help: this information
+
+All input files must be in the `Simple Name File Format <#simple>`_.
+
+This tool will create one output file per query file. The per-query output file(s) will have name(s) of the form ``<query_file>.compared``.
+
+If there are multiple query files, a summary file containing the closest query match for each reference name will also be created. The summary file will be named ``<reference_file>.summary``.
+
+Each line in the two-way comparison result will consist of the following tab-separated fields:
+
+::
+
+    0.  ID. This is the string from the first field of the entry from the reference file.
+    1.  Score. The distance between the two names.
+    2.  Reference name. The reference name used for the comparison.
+    3.  Query name. The query name used for the comparison.
+
+If a summary file is generated, each line in that file will consist of the following tab-separated fields:
+
+::
+
+    0.  ID. This is the string from the first field of the entry from the reference file.
+    1.  Score. The distance between the two names.
+    2.  Reference name. The reference name used for the comparison.
+    3.  Best query name. The best matching query name.  In cases where multiple query names scored identically, the first name with that score will appear here. (This will typically only happen for completely dissimilar names)
+    4.  Best query source. The basename of the file which held the best query name. (ex: query_file1) In cases where multiple query names scored identically, multiple basenames will be present in this column, separated by semicolons. (ex: query_file1;query_file2)
+
+Results are presented in the same order as in the input reference file.  Names in query files that correspond to an ID not present in the reference file will be ignored. Names in the reference file with no corresponding query are scored as a complete miss (1.0). Input query and reference files may reside in any directory, but no two files may have the same basename.
+
+Genepidgin *compare* Score Range
+--------------------------------
+
+The distribution in accuracy is not linear between 0.0 and 1.0; that is, after a certain level of dissimilarity it doesn't matter how much more dissimilar two names are.
+
+The following table presents a quick guide to the interpretation of distance scores.
+
++-------------+------------------------------------------------------------+
+| score       | likelihood of functional match                             |
++=============+============================================================+
+| =0.0        | functionally identical                                     |
++-------------+------------------------------------------------------------+
+| 0.0 - 0.1   | excellent match                                            |
++-------------+------------------------------------------------------------+
+| 0.1 - 0.3   | good match                                                 |
++-------------+------------------------------------------------------------+
+| 0.3 - 0.5   | possibly similar, with potentially significant distances   |
++-------------+------------------------------------------------------------+
+| 0.5 - 1.0   | not generally useful                                       |
++-------------+------------------------------------------------------------+
+| =1.0        | completely different                                       |
++-------------+------------------------------------------------------------+
+
+There is support for using the output of Genepidgin *compare* directly within Python; consult ``genepidgin/scorer.py`` for details.
+
+.. note:: Please see :doc:`credits` for contributor information.
+

File docs/conf.py

+# -*- coding: utf-8 -*-
+#
+# Speedrack documentation build configuration file, created by
+# sphinx-quickstart on Sun Jul  8 18:35:52 2012.
+#
+# This file is execfile()d with the current directory set to its containing dir.
+#
+# Note that not all possible configuration values are present in this
+# autogenerated file.
+#
+# All configuration values have a default; values that are commented out
+# serve to show the default.
+
+import sys, os
+
+# If extensions (or modules to document with autodoc) are in another directory,
+# add these directories to sys.path here. If the directory is relative to the
+# documentation root, use os.path.abspath to make it absolute, like shown here.
+#sys.path.insert(0, os.path.abspath('.'))
+
+# -- General configuration -----------------------------------------------------
+
+# If your documentation needs a minimal Sphinx version, state it here.
+#needs_sphinx = '1.0'
+
+# Add any Sphinx extension module names here, as strings. They can be extensions
+# coming with Sphinx (named 'sphinx.ext.*') or your custom ones.
+extensions = []
+
+# Add any paths that contain templates here, relative to this directory.
+templates_path = ['_templates']
+
+# The suffix of source filenames.
+source_suffix = '.rst'
+
+# The encoding of source files.
+#source_encoding = 'utf-8-sig'
+
+# The master toctree document.
+master_doc = 'index'
+
+# General information about the project.
+project = u'Genepidgin'
+copyright = u'2012, Clint Howarth'
+
+# The version info for the project you're documenting, acts as replacement for
+# |version| and |release|, also used in various other places throughout the
+# built documents.
+#
+# The short X.Y version.
+
+sys.path.insert(0, os.path.abspath('../'))
+from genepidgin import version
+# The full version, including alpha/beta/rc tags.
+release = version
+
+# The language for content autogenerated by Sphinx. Refer to documentation
+# for a list of supported languages.
+#language = None
+
+# There are two options for replacing |today|: either, you set today to some
+# non-false value, then it is used:
+#today = ''
+# Else, today_fmt is used as the format for a strftime call.
+#today_fmt = '%B %d, %Y'
+
+# List of patterns, relative to source directory, that match files and
+# directories to ignore when looking for source files.
+exclude_patterns = ['_build']
+
+# The reST default role (used for this markup: `text`) to use for all documents.
+#default_role = None
+
+# If true, '()' will be appended to :func: etc. cross-reference text.
+#add_function_parentheses = True
+
+# If true, the current module name will be prepended to all description
+# unit titles (such as .. function::).
+#add_module_names = True
+
+# If true, sectionauthor and moduleauthor directives will be shown in the
+# output. They are ignored by default.
+#show_authors = False
+
+# The name of the Pygments (syntax highlighting) style to use.
+pygments_style = 'sphinx'
+
+# A list of ignored prefixes for module index sorting.
+#modindex_common_prefix = []
+
+
+# -- Options for HTML output ---------------------------------------------------
+
+# The theme to use for HTML and HTML Help pages.  See the documentation for
+# a list of builtin themes.
+html_theme = 'default'
+
+# Theme options are theme-specific and customize the look and feel of a theme
+# further.  For a list of options available for each theme, see the
+# documentation.
+#html_theme_options = {}
+
+# Add any paths that contain custom themes here, relative to this directory.
+#html_theme_path = []
+
+# The name for this set of Sphinx documents.  If None, it defaults to
+# "<project> v<release> documentation".
+#html_title = None
+
+# A shorter title for the navigation bar.  Default is the same as html_title.
+#html_short_title = None
+
+# The name of an image file (relative to this directory) to place at the top
+# of the sidebar.
+#html_logo = None
+
+# The name of an image file (within the static path) to use as favicon of the
+# docs.  This file should be a Windows icon file (.ico) being 16x16 or 32x32
+# pixels large.
+#html_favicon = None
+
+# Add any paths that contain custom static files (such as style sheets) here,
+# relative to this directory. They are copied after the builtin static files,
+# so a file named "default.css" will overwrite the builtin "default.css".
+html_static_path = ['_static']
+
+# If not '', a 'Last updated on:' timestamp is inserted at every page bottom,
+# using the given strftime format.
+#html_last_updated_fmt = '%b %d, %Y'
+
+# If true, SmartyPants will be used to convert quotes and dashes to
+# typographically correct entities.
+#html_use_smartypants = True
+
+# Custom sidebar templates, maps document names to template names.
+#html_sidebars = {}
+
+# Additional templates that should be rendered to pages, maps page names to
+# template names.
+#html_additional_pages = {}
+
+# If false, no module index is generated.
+#html_domain_indices = True
+
+# If false, no index is generated.
+#html_use_index = True
+
+# If true, the index is split into individual pages for each letter.
+#html_split_index = False
+
+# If true, links to the reST sources are added to the pages.
+#html_show_sourcelink = True
+
+# If true, "Created using Sphinx" is shown in the HTML footer. Default is True.
+#html_show_sphinx = True
+
+# If true, "(C) Copyright ..." is shown in the HTML footer. Default is True.
+#html_show_copyright = True
+
+# If true, an OpenSearch description file will be output, and all pages will
+# contain a <link> tag referring to it.  The value of this option must be the
+# base URL from which the finished HTML is served.
+#html_use_opensearch = ''
+
+# This is the file name suffix for HTML files (e.g. ".xhtml").
+#html_file_suffix = None
+
+# Output file base name for HTML help builder.
+htmlhelp_basename = 'Genepidgindoc'
+
+
+# -- Options for LaTeX output --------------------------------------------------
+
+latex_elements = {
+# The paper size ('letterpaper' or 'a4paper').
+#'papersize': 'letterpaper',
+
+# The font size ('10pt', '11pt' or '12pt').
+#'pointsize': '10pt',
+
+# Additional stuff for the LaTeX preamble.
+#'preamble': '',
+}
+
+# Grouping the document tree into LaTeX files. List of tuples
+# (source start file, target name, title, author, documentclass [howto/manual]).
+latex_documents = [
+  ('index', 'Genepidgin.tex', u'Genepidgin Documentation',
+   u'Clint Howarth', 'manual'),
+]
+
+# The name of an image file (relative to this directory) to place at the top of
+# the title page.
+#latex_logo = None
+
+# For "manual" documents, if this is true, then toplevel headings are parts,
+# not chapters.
+#latex_use_parts = False
+
+# If true, show page references after internal links.
+#latex_show_pagerefs = False
+
+# If true, show URL addresses after external links.
+#latex_show_urls = False
+
+# Documents to append as an appendix to all manuals.
+#latex_appendices = []
+
+# If false, no module index is generated.
+#latex_domain_indices = True
+
+
+# -- Options for manual page output --------------------------------------------
+
+# One entry per manual page. List of tuples
+# (source start file, name, description, authors, manual section).
+man_pages = [
+    ('index', 'genepidgin', u'Genepidgin Documentation',
+     [u'Clint Howarth'], 1)
+]
+
+# If true, show URL addresses after external links.
+#man_show_urls = False
+
+
+# -- Options for Texinfo output ------------------------------------------------
+
+# Grouping the document tree into Texinfo files. List of tuples
+# (source start file, target name, title, author,
+#  dir menu entry, description, category)
+texinfo_documents = [
+  ('index', 'Genepidgin', u'Genepidgin Documentation',
+   u'Clint Howarth', 'Genepidgin', 'One line description of project.',
+   'Miscellaneous'),
+]
+
+# Documents to append as an appendix to all manuals.
+#texinfo_appendices = []
+
+# If false, no module index is generated.
+#texinfo_domain_indices = True
+
+# How to display URL addresses: 'footnote', 'no', or 'inline'.
+#texinfo_show_urls = 'footnote'

File docs/credits.rst

+Credits
+-------
+
+Genepidgin was written by Clint Howarth and Matthew Pearson. Many people have contributed to the project:
+
+cleaner
+=======
+
+The design of ``genepidgin cleaner`` grew out of years of suggestions from many people, including annotators who have worked in Genome Annotation in the Microbial Sequencing Platform at the Broad Institute. It was implemented by Clint Howarth and Matthew Pearson.
+
+Many people have contributed to the name cleaning logic, including: Lucia Alvarado-Balderrama¹, Sinead Chapman¹, Zehua Chen¹, Jonathan Goldberg¹, Sharvari Gujja¹, Clint Howarth¹, Chinnappa Kodira², Teena Mehta¹, Matthew Pearson¹, Narmada Shenoy¹, Tom Walk¹, Chandri Yandava¹, Qiandong Zeng¹, and the Autoannotate development team³.
+
+¹ `Broad Institute <http://www.broadinstitute.org>`_
+² `454 Life Sciences <http://www.454.com/>`_
+³ `J. Craig Venter Institute <http://www.jcvi.org/>`_
+
+
+compare
+=======
+
+``genepidgin compare`` was designed and implemented by Matthew Pearson.
+
+It includes an open-source implementation of the `Damerau-Levenshtein distance`_ written by `Michael Homer`_.
+
+.. _`Damerau-Levenshtein distance`: http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance
+.. _`Michael Homer`: http://mwh.geek.nz/2009/04/26/python-damerau-levenshtein-distance/
+
+select
+======
+
+``genepidgin select`` was designed by Sharvari Gujja, Brian Haas, Clint Howarth, Matthew Pearson, and Qiandong Zeng. Clint Howarth implemented it.
+
+Special Thanks
+==============
+
+Finally, thanks to the `Autoannotate`_ development team at `JCVI`_, who were kind enough to share the source code of their naming utility with us.  Seeing how hard their institute worked to reformat names motivated us to release and document our own naming logic.
+
+.. _`JCVI`: http//:www.jcvi.org
+.. _`Autoannotate`: http://sourceforge.net/projects/prokfunautoanno/
+
+Project Name History
+====================
+
+This project began life as BioName. It turns out that there already is a project named `Bioname`_. Though this BioName addresses a completely different problem, our goal is to help reduce name-related confusion. Thus we decided to change the name of our software toolkit to Pidgin. We retain the term BioName as an internal class name for source compatibility. We are aware that there is an IM chat client called `Pidgin`_, and even though it's completely unrelated to gene naming, some people found this confusing. This project is now ``Genepidgin``, and that's that.
+
+We would like to take this opportunity to point out that naming is a challenging problem, on many levels. We apologize for any confusion.
+
+.. _`Pidgin`: http://www.pidgin.im/
+.. _`Bioname`: http://bioname.org
+

File docs/index.rst

+==========
+Genepidgin
+==========
+
+Genepidgin is a suite of tools that assist in evaluation and assignment gene product names. There are three primary components:
+
+:doc:`cleaner`
+    standardizes gene names per UNIPROT naming guidelines
+:doc:`compare`
+    compares two or more sets of gene names
+:doc:`select`
+    selects the most appropriate product name from a vareity of homology evidence
+
+``genepidgin`` is developed and maintained by engineers and biologists at the `Broad Institute <http://www.broadinstitute.org>`_. Suggestions are welcome; we can be reached at ``pidgin-support at broadinstitute dot org``.
+
+Development Status
+------------------
+
+.. warning:: *This code is not under active development, and there are better ways of doing this.* When we started this project, well-defined ontology sets were uncertain, but there are enough around that this approach is relatively antiquated. Nowadays, you're almost certainly better off with EC lookups, go-terms and similar, more direct methods.
+
+Contents
+--------
+
+.. toctree::
+   :maxdepth: 2
+   
+   setup
+   cleaner
+   compare
+   select
+   credits
+   changes
+   simple_name_file_format
+   license

File docs/license.rst

+License Information
+-------------------
+
+Pidgin is offered under the BSD license.
+
+::
+
+    #
+    # Copyright (c) 2009 The Broad Institute, Inc. All rights reserved.
+    #
+    # Redistribution and use in source and binary forms, with or without
+    # modification, are permitted provided that the following conditions
+    # are met:
+    #
+    # Redistributions of source code must retain the above copyright notice,
+    # this list of conditions and the following disclaimer.
+    #
+    # Redistributions in binary form must reproduce the above copyright
+    # notice, this list of conditions and the following disclaimer in the
+    # documentation and/or other materials provided with the distribution.
+    #
+    # Neither the name of the Broad Institute nor the names of its
+    # contributors may be used to endorse or promote products derived from
+    # this software without specific prior written permission.
+    #
+    # THIS SOFTWARE IS PROVIDED BY THE BROAD INSTITUTE ''AS IS'' AND ANY
+    # EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
+    # IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
+    # PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE BROAD INSTITUTE BE
+    # LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR
+    # CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF
+    # SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR
+    # BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY,
+    # WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE
+    # OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE,
+    # EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+    #

File docs/select.rst

+Genepidgin *select*
+===================
+
+Goals
+-----
+
+Genepidgin *select* generates gene product names from alignments to proteins in curated libraries (currently FIGfam, KEGG, Pfam, RefSeq, SwissProt and TIGRFAM). Blast and hmmer alignments from those libraries are read into Genepidgin via simple data formats (`.pidginb`_ and `.pidginh`_, respectively), where they are sifted through to find the best name.
+
+Selection Recipe
+----------------
+
+Summary
+~~~~~~~
+
+Sort qualifying sources, preferring: hmmer alignments to blast alignments, a lower e-value in hmmer hits, and a higher percent identity in blast hits. Walk through the sorted list until we find a name that remains informative after running through :doc:`cleaner`.
+
+Details
+~~~~~~~
+
+Group all evidence by ``dest_id`` and consider each ``dest_id`` independently.
+
+Over the course of this search, if a name filters to something uninformative (via :doc:`cleaner`), then examine the next relevant source, until either a valid source and name are found, or no sources remain and the name "hypothetical protein" is assigned.
+
+Start by examining the hmmer hits. Remove hits that are neither TIGRFAM equivalogs nor Pfam hits labeled as equivalog-equivalents by JCVI. Next, remove hits whose score is less than its ``family_trusted_cutoff`` (see `.pidginh`_). Take the name of the hit with the lowest e-value. If multiple hits have equivalent e-values, select the hit with the highest bit score.
+
+If a ``dest_id`` has no hmmer hits deemed sutable for naming, examine the blast evidence (see `.pidginb`_ or `.blastm8`_), calculating the following terms:
+
+::
+
+    source_coverage = (source_stop - source_start + 1) / source_len
+    dest_coverage = (dest_stop - dest_start + 1) / dest_len
+    min_coverage = min(source_coverage, dest_coverage)
+
+    source_pct_identity = num_identities / source_len
+    dest_pct_identity = num_identities / dest_len
+    min_pct_identity = min(source_pct_identity, dest_pct_identity)
+
+    upper_pct_identity = max(min_pct_identity for all hits whose min_coverage ≥ 0.6)
+    lower_pct_identity = max(0.5, upper_pct_identity - 0.05)
+
+Cluster all hits associated with ``dest_id`` that have ``min_coverage`` ≥ 0.6 and whose ``min_pct_identity`` is between ``upper_pct_identity`` and ``lower_pct_identity`` (inclusive). If ``upper_pct_identity`` < ``lower_pct_identity``, ignore all hits.
+
+If the cluster is not empty, and any of the hits in the cluster has a ``source_auth`` (see `.pidginb`_) of KEGG, then select the name from the one with the highest ``min_pct_identity``. If there are no hits from KEGG, proceed to SwissProt hits, then FIGfam and finally RefSeq, searching in each bin for the hit with the highest ``min_pct_identity`` within that bin.
+
+Usage
+-----
+
+Given a series of data files, use the selection recipe described above to determine product names for the given genes.
+
+::
+
+    genepidgin select (options) [inputfiles]
+
+    options:
+      -o --output    : where to save files, defaults to ./pidgin_names.txt
+      -e --etymology : where to save etymology (debug), defaults to ./pidgin_etymology.txt
+      -h --help      : this information
+
+The format of Input and Output files are described below.
+
+Input
+-----
+
+Any number of input files following the following three formats are permitted. The ordering of the files, and the ordering of the lines within the files, does not matter. No tabs, newlines, or control characters are permitted in any of these fields.
+
+``.pidginb``
+~~~~~~~~~~~~
+
+All files with the extension ``.pidginb`` are assumed to contain BLAST alignments.
+
+Each line in a ``.pidginb`` file will consist of the following tab-separated fields:
+
+::
+
+    0.  dest_id STRING an identifier for a destination protein (i.e., a protein that
+        should receive a name)
+    1.  dest_start INTEGER 1-based index of first aligned amino acid in destination
+        protein
+    2.  dest_stop INTEGER 1-based index of last aligned amino acid in destination
+        protein
+    3.  dest_len INTEGER number of amino acids in destination protein
+    4.  source_id STRING an identifier for a source protein (i.e., a protein whose
+        name should be considered for assignment to the destination protein)
+    5.  source_start INTEGER 1-based index of first aligned amino acid in source
+        protein
+    6.  source_stop INTEGER 1-based index of last aligned amino acid in source
+        protein
+    7.  source_len INTEGER number of amino acids in source protein
+    8.  source_auth STRING the source of the data, used for heuristic processing,
+        must be one of:
+          - "FIGfam"
+          - "KEGG"
+          - "RefSeq"
+          - "SwissProt"
+    9.  num_identities INTEGER number of exact amino acid matches in alignment
+    10. num_similarities INTEGER number of similar amino acid matches in alignment
+    11. raw_name STRING the name of the source protein
+    12. comment STRING can be used for any purpose
+
+A sample line:
+
+::
+
+    7000002454063496        134     581     448     7000000120703332        127     596     470     FIGfam      151     227     FIG029094-5 IncW plasmid conjugative protein TrwB (TraD homolog)
+
+``.pidginh``
+~~~~~~~~~~~~
+
+All files with the extension ``.pidginh`` are assumed to contain HMMER alignments.
+
+Note: per-domain scores are ignored; we consider the whole hit only.
+
+Each line in a ``.pidginh`` file will consist of the following tab-separated fields:
+
+::
+
+    0.  dest_id STRING an identifier for a destination protein (i.e., a protein that
+        should receive a name)
+    1.  dest_start INTEGER 1-based index of first aligned amino acid in destination
+        protein
+    2.  dest_stop INTEGER 1-based index of last aligned amino acid in destination
+        protein
+    3.  dest_len INTEGER number of amino acids in destination protein
+    4.  source_id STRING an identifier for a source family (i.e., a profile whose
+        name should be considered for assignment to the destination protein)
+        currently should be a TIGRFAM or Pfam id.
+    5.  source_start INTEGER 1-based index of first aligned position in source family
+    6.  source_stop INTEGER 1-based index of last aligned position in source family
+    7.  source_len INTEGER number of positions in source family
+    8.  score FLOAT score reported by hmmer
+    9.  family_trusted_cutoff FLOAT
+    10. e_value FLOAT+INTEGER in the format X.XXeY where X.XX is a positive float and Y is an integer
+    11. raw_name STRING the name of the source family
+    12. comment STRING can be used for any purpose
+
+A sample line:
+
+::
+
+    7000002454071269        3       140     138     TIGRfam 13      155     143     83.519997       80.000000   -21.585027      ribosomal-protein-alanine acetyltransferase
+
+``.blastm8``
+~~~~~~~~~~~~
+
+Blast ``-m8`` format is also acceptable, but requires also submitting a name key via ``--ref``, as m8 format contains no names. This method also has much slower execution time.
+
+It is assumed that all names derived from ``.blastm8`` have lower priority than other sources.
+
+Output
+------
+
+The names of these files are governed by the option usage, as described above.
+
+Names
+~~~~~
+
+Each line of the name file has four columns:
+
+::
+
+    0. dest_id STRING an identifier for a destination protein (i.e., a protein that should receive a name)
+    1. name STRING the best available name for the destination protein
+    2. source_id STRING the id of the blast or hmmer hit used to name this protein
+    3. comment STRING the comment field from the line used to name this protein
+
+A snippet from a names.txt from a development run:
+
+::
+
+    7000002454076078   fructose-1-6-bisphosphatase   FIGfam    run on library updated 2009/10/22
+    7000002454076081   hypothetical protein          (blank)   (blank)
+
+Note that hypothetical proteins don't have the final two fields, as they did not pick up a name from the given sources.
+
+Etymology
+~~~~~~~~~
+
+The etymology file consists of a sequence of entries. Each entry describes the process by which the resulting name was given, showing tracking information as data is discarded and then summary information of how the name was cleaned up (plugs directly into :doc:`cleaner`) before it is presented.
+
+Entries are separated by five equals signs and a newline: ``=====``
+
+Each entry begins with the dest\_id alone on the first line of the block. Convenient for searching!
+
+A snippet of a local run:
+
+::
+
+    7000002454076078
+    1 hmmer source found.
+    0 hmmer sources were removed due to not meeting the trusted family score.
+    One hmmer source had a good name.
+    Found an acceptable name in the hmmer sources. The one we liked best came from:
+    ./test/Rho_sphaeroides_241_HMMERTRANSCRIPTS_17.pidginh:2013
+    This source's name was cleaned up by genepidgin:
+    filtered name in 1 step:
+    0) original: Fructose-1-6-bisphosphatase
+    1)   reason: protein names should not start with a capital letter
+        pattern: (?:(?<=similar to )|^)([A-Z])(?=[a-z][a-z]+([ /,-]|$))
+       filtered: fructose-1-6-bisphosphatase
+    Final name: fructose-1-6-bisphosphatase
+    =====
+    7000002454076081
+    0 hmmer sources found.
+    No name was derived from hmmer sources.
+    2 blast sources were found.
+    0 blast sources were removed by filtering for low coverage (<0.6).
+    The highest percent identity of any remaining blast source is 0.992. The lowest is 0.945.
+    0 blast sources were removed due to not being within the percent identity window (0.992, 0.942).
+    All 2 blast sources had names that filtered to nothing.
+    No name was ultimately selected from any of the supplied sources.
+    Final name: hypothetical protein
+
+.. note:: Please see :doc:`credits` for contributor information.
+

File docs/setup.rst

+Installation
+------------
+
+Genepidgin requires *Python 2.5+*. Python comes with easy_install.
+
+(If you don't have ``pip``, prepend the following instructions with: ``easy_install pip``)
+
+::
+
+    pip install genepidgin
+
+For more instructions on subcommands, visit :doc:`cleaner`, :doc:`compare`, or :doc:`select`.
+
+Testing
+=======
+
+Development and testing requires the nose package. To execute the unit tests, ``pip install nose`` and then execute:
+
+::
+
+    nosetests

File docs/simple_name_file_format.rst

+Simple Name File Format
+-----------------------
+
+We try to use the same input/output format for names as much as possible throughout ``genepidgin``.
+
+The simple name file format is a flat text file. It's human-readable and was designed with simple database interactions in mind.
+
+Each line has three columns:
+
+#. A unique identifier, for example a database id, or simply a blank space.
+#. A tab character (``\t``)
+#. The name, until the first tab character.
+
+Ignored:
+
+- Lines beginning with ``#``
+- Any information following the second tab in a line
+
+An example of a simple name file:
+
+::
+
+    id1    the name can be any length
+    id2    and have any character but a newline
+    # this line is ignored
+    id3    this name is not ignored
+    id4    name followed by tab                   this information is ignored

File genepidgin/__init__.py

+version_info = (1, 1, 0)
+version = '.'.join(str(n) for n in version_info[:3])
+release = version + ''.join(str(n) for n in version_info[3:])

File genepidgin/cleaner.py

+#!/bin/env python
+
+#
+# If you are a python fan and are wondering why BioName python code
+# seems stuck in 2001 idioms, it is because we are using jython, which
+# -- while a wonderful development experience -- is currently holding
+# tight at python language version 2.1 (released in 2001). We are very
+# much looking forward to the upcoming jython 2.5 release, so that we
+# can jump forward to 2006.
+#
+
+from __future__ import nested_scopes
+
+import codecs
+import re
+
+import filters
+import util
+
+from filters import FilterByFunction
+from filters import FilterGroup
+from filters import FilterRemove
+from filters import FilterReplace
+
+#
+# In the large beginning section of this file, we have some functions and
+# regular expressions. They're outside of the objects/functions because
+# we should only compile them once (on import) and then use them as
+# many times as we like.
+#
+# Later on, there are objects and interfaces to make these interfaces
+# palatable.
+#
+
+NOTAUTOGEN = "not automatically assigned"
+RETAINWEAK = "retains weak names"
+REORDERFAMILY = "reorder family predicates"
+REMOVETRAILING = "remove trailing clauses"
+
+# the following regular expressions are used in the name sorter
+predictedProteinRe = re.compile(
+  r"^\s*predicted\s+protein\s*$", re.I)
+hypotheticalProteinRe = re.compile(
+  r"^\s*hypothetical\s+protein(\s*|,.*)$", re.I)
+weakPredictionRe = re.compile(
+  r"^\s*hypothetical\s+protein\s+similar\s+to\s*(.*)$", re.I)
+
+# the name we give when we can't find a name
+DEFAULTNAME = "hypothetical protein"
+
+ecRe = re.compile(r"\(?EC +([0-9\.]+)\)?")
+def cleanupEC(name):
+    """
+    Eliminate EC names for now. The goal is to be smarter about this in the future.
+    """
+    if ecRe.search(name) is not None:
+        name = ecRe.sub("", name)
+    return name
+
+
+sevenBitRe = re.compile(r"[^\w\-_.,;:'+-/\\()\[\]]")
+def asciiify(name):
+    """
+    Make sure that the name is exportable 7-bit ascii.
+    """
+    try:
+        name = codecs.ascii_encode(name)[0]
+    except UnicodeError:
+        # force it into compliance by replacing weird chars with spaces
+        name = sevenBitRe.sub(" ", name)
+        name = codecs.ascii_encode(name)[0]
+    return name
+
+
+secondaryParenRe = re.compile(r"(.*\).*)\([^)]*\(.*")
+def cleanupNestedSecondaryParentheses(name):
+    """
+    If a name has nested parentheses after regular parentheses, which
+    are separated from the rest of the name by whitespace, strike them.
+
+    23S rRNA (U5-)-transferase rumA (23S rRNA(M-5-)-transferase) ...
+                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+    """
+    name = secondaryParenRe.sub(lambda x: x.group(1), name)
+
+    #
+    # Find the first unmatched close parenthesis, bracket and/or brace,
+    # then trim the name so that it no longer contains it.
+    #
+    for puncts in [ ("(", ")"), ("[", "]"), ("{", "}") ]:
+        numOpen = name.count(puncts[0])
+        numClosed = name.count(puncts[1])
+        if numClosed > numOpen:
+            i = 0
+            pos = 0
+            while i < numOpen + 1:
+                pos = name.find(puncts[1], pos) + 1
+                i += 1
+            name = name[:pos-1]
+
+    return name
+
+
+hspRe = re.compile(r"heat\s+shock|\bhsp[\b0-9]", re.I)
+hspWeightRe = re.compile(r"(?:([\d\.]+) kDa)|(?:\bhsp(\d+))", re.I)
+def cleanupHeatShockNames(name):
+    """
+    If a name indicates a heat-shock protein, isolate and standardize
+    the relevant part of it.
+    """
+    if hspRe.search(name):
+        match = hspWeightRe.search(name)
+        if match:
+            weight = match.group(1) or match.group(2)
+            corename = "hsp" + weight
+            return "%s-like protein" % corename
+
+    return name
+
+
+endsDomainRe = re.compile(r"\bdomain\s*$", re.I)
+hasDoubleContainsRe = re.compile(r"\bcontaining\s+([A-Z0-9]+)\s+domain-containing\s*(protein)?")
+containsProteinRe = re.compile(r"\bprotein\b", re.I)
+def cleanupDomainEnd(name):
+    """
+    If a name ends in "domain" but does not have protein anywhere else
+    in it, switch to "domain-containing protein".
+    If a name has the word "containing" twice, try to clean it up.
+    """
+    if hasDoubleContainsRe.search(name):
+        name = hasDoubleContainsRe.sub(lambda x: "containing %s domain" % x.group(1), name)
+
+    # If ends in domain and doesn't contain protein...
+    elif endsDomainRe.search(name) and not containsProteinRe.search(name):
+        name = endsDomainRe.sub("domain-containing protein", name)
+
+    return name
+
+
+proteinFriendlyWordList = [
+    "phosphatase",
+    "kinase",
+    "transport",
+    "proteinase",
+    "export",
+    "disulfide",
+    "isomerase",
+    "methyltransferase",
+]
+startsWithProteinRe = re.compile(r"^\s*protein\s+")
+hasValidProteinPredicateRe = re.compile(r"|".join(proteinFriendlyWordList))
+def cleanupStartsWithProtein(name):
+    """
+    If a name begins with "protein", the remainder of the name
+    determines whether or not to remove that word.
+    (This is a separate function because we can't have
+    variable-width lookaheads)
+    """
+    if startsWithProteinRe.match(name) and not hasValidProteinPredicateRe.search(name):
+        name = startsWithProteinRe.sub("", name)
+    return name
+
+
+repeatRe = re.compile(r"(\w{3,})\s+\1")
+def cleanupRepeats(name):
+    """
+    Clean up all sequential repeated words except for those
+    in a given list.
+    """
+    acceptableRepeats = ["kinase"]
+
+    mRe = repeatRe.search(name)
+    if not mRe:
+        return name
+    word = mRe.group(1)
+    if word:
+        for okWord in acceptableRepeats:
+            if word == okWord:
+                return name
+        name = repeatRe.sub(lambda x: x.group(1), name)
+    return name
+
+
+multiWhitespaceRe = re.compile(r"\s+")
+def cleanupWhitespace(name):
+    "Like strip(), but moreso."
+    name = name.strip()
+    if multiWhitespaceRe.search(name):
+        name = multiWhitespaceRe.sub(" ", name)
+    return name
+
+
+# Underscores typify ids, except in certain baffling cases.
+# Negative lookbehinds can't accept variable width characters, so we just
+# do it a little more deliberately. Probably reads better, too.
+underscoreFriendlyWordList = [
+    "PE_PGRS",
+    "VRR",
+]
+underscoreIdsRe = re.compile(r"\b([A-Z0-9]{2,5}_[A-Z0-9]{3,5})\b")
+underscoreExceptionsRe = re.compile(r"|".join(underscoreFriendlyWordList))
+def removeUnderscoreIds(name):
+    mRe = underscoreIdsRe.search(name)
+    if not mRe or not mRe.groups():
+        return name
+    potentialIdMatch = mRe.group(1)
+    if underscoreExceptionsRe.match(potentialIdMatch):
+        return name
+    else:
+        return re.sub(potentialIdMatch, "", name)
+
+
+# This is not used in BioName.
+keggExtractRe = re.compile(r"(.*?)\s;\s(.*)")
+def extractKEGG(name):
+    """
+    If a name is a unified KEGG field in the form:
+
+    definition ; orthology
+
+    ...then return just the orthology.
+    Otherwise, return the whole name.
+    """
+    m1 = keggExtractRe.search(name)
+    if m1:
+        return m1.group(2)
+    else:
+        return name
+
+
+
+###
+# BioName is a collection of filters, originally intended to obtain decent
+# information from blast/NR.
+#
+# The overarching philosophy is: the names coming in aren't high
+# quality. Most of the rules are designed to find ways to throw the
+# entire name away. The rest are reformatted in an attempt to create
+# a standard format.
+#
+class BioName:
+    """
+    A utility class that cleans up gene names.
+
+    @param saveWeakNames binary: attempt to recreating hmp-style retention of weak names
+    @param removeTrailingClauses binary: kill predicates hidden behind punctuation
+    @param reorderFamily binary: make "X, family Y" into "Y family X"
+    @param hmp binary: if 1, override the other variables to "the hmp experience"
+    """
+    def __init__(self, minNameLength=3, maxNameLength=100,
+        saveWeakNames=0, removeTrailingClauses=0, reorderFamily=1, hmp=0):
+
+        self.minNameLength = minNameLength
+        self.maxNameLength = maxNameLength
+        self.saveWeakNames = saveWeakNames
+        self.removeTrailingClauses = removeTrailingClauses
+        self.reorderFamily = reorderFamily
+
+        self.hmp = hmp
+        if self.hmp:
+            self.removeTrailingClauses = 0
+            self.reorderFamily = 0
+            self.saveWeakNames = 1
+
+        self._compileFilters()
+        self._createFilterGroup()
+
+    def _createFilterGroup(self):
+
+        fgroup = FilterGroup(showPattern=1, outputType=filters.TEXT)
+
+        fgroup.addAll(self.killWholeNameList)
+
+        fgroup.add(FilterByFunction(cleanupWhitespace, "cleanup whitespace"))
+        fgroup.add(FilterByFunction(asciiify, "cleanup non-ascii characters"))
+        fgroup.add(FilterByFunction(cleanupEC, "cleanup EC numbers"))
+        fgroup.addAll(self.initialDistillationList)
+        fgroup.addAll(self.typoList)
+
+        fgroup.add(FilterByFunction(cleanupHeatShockNames, "cleanup heat shock names"))
+        fgroup.addAll(self.clauseRemovalList)
+        fgroup.addAll(self.allClauseRemovalList)
+        fgroup.addAll(self.weakNameSaveList)
+        fgroup.addAll(self.idDeletionList)
+        fgroup.addAll(self.clauseReplaceList)
+        fgroup.addAll(self.organismNameList)
+        fgroup.addAll(self.punctuationList)
+        fgroup.add(FilterByFunction(cleanupNestedSecondaryParentheses, "remove nested secondary parens"))
+        fgroup.addAll(self.automaticPunctuationList)
+        fgroup.add(FilterByFunction(cleanupWhitespace, "cleanup whitespace"))
+        fgroup.addAll(self.cleanupList)
+        fgroup.add(FilterByFunction(cleanupRepeats, "cleanup repeats within names"))
+        # this rule is too strict when applied to swiss-prot - reactivate for NR
+#		fgroup.add(FilterByFunction(cleanupStartsWithProtein, "cleanup names that begin with protein"))
+        fgroup.add(FilterByFunction(cleanupDomainEnd, "cleanup domain at end of name"))
+        fgroup.add(FilterByFunction(cleanupWhitespace, "cleanup whitespace"))
+        fgroup.addAll(self.wholeNameModificationList)
+        fgroup.addAll(self.capitalizeList)
+        fgroup.addAll(self.punctuationList)		# clean up punctuation a second time
+
+        # not external because the lengths are variable and coding around it
+        # would be globally awkward, rather than locally awkward
+        fgroup.add(FilterByFunction(
+            lambda x: (self.minNameLength <= len(x) <= self.maxNameLength) and x or None,
+            "names that are too short or too long",
+            skipif=[NOTAUTOGEN]
+        ))
+
+        self.fgroup = fgroup
+
+
+    #
+    # Filtering rules for filter() and cleanup().
+    # The order of processing is determined by _createFilterGroup().
+    #
+    def _compileFilters(self):
+
+        #
+        # These patterns trigger whole name deletions via a simple
+        # substring match, typically from a list of bad substrings.
+        #
+        self.killWholeNameList = []
+        killList = []
+
+        #
+        # Low confidence words indicate that whoever named the protein we're using for
+        # evidence was not sure of the name. Since we're not sure the protein
+        # we're annotating is the same as the evidence, we should be extra
+        # careful. Names with any of the following low confidence words are ignored.
+        #
+        lowConfidenceList = [
+            (r"dubious", None),
+            (r"DUF", None),
+            (r"doubtful", None),
+            (r"fragment", None),
+            (r"homolog[ue]*?", "homolog"),
+            (r"key", None),
+#			(r"like", None),
+            (r"may", None),
+            (r"novel", None),
+            (r"of", None),
+            (r"open\s+reading\s+frame", "open reading frame"),
+            (r"partial", None),
+            (r"po[s]+ibl[ey]", "possibly"),
+            (r"predicted", None),
+            (r"proba\s*ble", "probable"),
+            (r"product", None),
+            (r"putative", None),
+            (r"putavie", None),
+            (r"related", None),
+            (r"similar", None),
+            (r"similarity", None),
+            (r"synthetic", None),
+            (r"UPF", None),
+            (r"un[cs]haracteri[zs]ed", "uncharacterized"),
+            (r"unknow[n]?", "unknown"),
+            (r"unnamed", None),
+        ]
+        # Low confidence words must appear alone: enclosing them in \b matches "may"
+        # but not "mayflower", "DUF protein" but not "DUF9001 protein".
+        for (pat, desc) in lowConfidenceList:
+            killList.append(
+                FilterRemove(
+                    re.compile(r".*\b%s\b.*" % pat, re.I),
+                    desc = "killed by inclusion of [%s]" % (desc or pat),
+                    skipif=[RETAINWEAK]
+                )
+            )
+        # the following low confidence words are ok if manually assigned
+        # for example, "conserved hypothetical protein" is valid
+        # (this is going to need rewriting if one more tiny layer of complexity is added)
+        lowConfidenceNotAutogen = [
+            (r"conserved", None),
+            (r"hypothetical", None),
+        ]
+        for (pat, desc) in lowConfidenceNotAutogen:
+            killList.append(
+                FilterRemove(
+                    re.compile(r".*\b%s\b.*" % pat, re.I),
+                    desc = "killed by inclusion of [%s]" % (desc or pat),
+                    skipif=[RETAINWEAK, NOTAUTOGEN]
+                )
+            )
+
+        #
+        # use these names a second time for removeLowConfidence()
+        #
+        self.lowConfidenceRemoval = []
+        for (pat, desc) in lowConfidenceList:
+            self.lowConfidenceRemoval.append(
+                FilterRemove(
+                    re.compile(r"\b%s\b" % pat, re.I),
+                    desc = "removed low confidence word [%s]" % (desc or pat),
+                )
+            )
+
+
+        #
+        # The following additions to the drop list indicate specific names that
+        # are not trustworthy.
+        #
+        dropList = [
+            # got 400 of these on CNA1
+            (r"expressed\s+protein", None),
+            # not valid reference
+            (r"ORF\d*?", None),
+            (r"([A-Z])\s+(CHAIN|Chain|chain)\s+\1", "A Chain A names are always wrong"),
+            (r"^(CHAIN|Chain|chain)$", "names that are simply the word Chain are suspect"),
+            (r"ink\ 76", None),
+            (r"^(e_){0,1}gw1|^est_|^fge1_", "spurious aspergilli"),
+            (r"\.$", "names ending in period are bad"),
+            (r"\\", "names containing a backslash are suspect"),
+            #(r"\)\)|\(\(", "double parens are usually in bad names"),
+            (r"RIKEN", "RIKEN is a tag name, toss whole name"),
+        ]
+        for pat, desc in dropList:
+            killList.append(FilterRemove(re.compile(r".*\b%s\b.*" % pat, re.I), desc))
+
+        #
+        # The presence of a software names invalidates a name
+        #
+        softwareNames = [
+            (r"glimmer(\s*3)?", "glimmer"),
+            (r"gene(_|\s+)?id", "geneid"),
+            (r"genemark(hmm|hmmes)?", "genemark"),
+            (r"conrad", "conrad"),
+            (r"blast", "blast"),
+            (r"augustus", "augustus"),
+            (r"fgenesh(\++)?", "fgenesh"),
+            (r"hmmer", "hmmer"),
+            (r"metagene", "metagene"),
+            (r"snap", "snap"),
+            (r"zcurve[_]?[bv]?", "zcurve"),
+        ]
+        for sName, desc in softwareNames:
+            killList.append(FilterRemove(re.compile(r".*\b%s\b.*" % sName, re.I), "software title %s" % desc))
+
+        self.killWholeNameList = killList
+
+        #
+        # Capitalization rules.
+        #
+        self.capitalizeList = []
+        capList = []
+
+        capList.append(FilterReplace(
+            re.compile(r"(?:(?<=similar to )|^)([A-Z])(?=[a-z][a-z]+([ /,-]|$))"),
+            lambda x: x.group(0).lower(),
+            "protein names should not start with a capital letter"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r".*[A-Z]{6,}.*"),
+            lambda x: x.group(0).lower(),
+            "6+ consecutive capital letters: make the whole string lowercase"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"\bactin\b", re.I),
+            lambda x: x.group(0).lower(),
+            "replace ACTIN with actin, it's short and doesn't get caught"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"\brifin\b", re.I),
+            lambda x: x.group(0).lower(),
+            "replace RIFIN with rifin, it's short and doesn't get caught"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"\b(rieske|mur|cullin)\b", re.I),
+            lambda x: x.group(0).title(),
+            "recapitalize people / place names"
+        ))
+
+        # decapitalize other names - any surviving low confidence words should be
+        # lower case, including "conserved hypothetical" and "lowConfidence protein"
+        for ww, desc in lowConfidenceList:
+            capList.append(FilterReplace(
+                re.compile(r"\b%s\b" % ww, re.I),
+                lambda x: x.group(0).lower(),
+                "make low confidence words lowercase"
+            ))
+        for ww in ["conserved", "protein"]:
+            capList.append(FilterReplace(
+                    re.compile(r"\b%s\b" % ww, re.I),
+                    lambda x: x.group(0).lower(),
+                    "lowercase %s" % ww
+            ))
+
+        capList.append(FilterReplace(
+            re.compile(r"\bFamily\b\s*$"),
+            "family",
+            "lowercase family/superfamily when last word"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"\bSuperfamily\b\s*$"),
+            "superfamily",
+            "lowercase family/superfamily when last word"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"\b[ivx]+\b", re.I),
+            lambda x: x.group(0).upper(),
+            "uppercase roman numerals"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"[dr]na\b", re.I),
+            lambda x: x.group(0).upper(),
+            "uppercase DNA/RNA"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"([ag]tp)(\b|ase)", re.I),
+            lambda x: x.group(1).upper() + x.group(2).lower(),
+            "uppercase ATP/GTP"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"nad\w*h\b", re.I),
+            lambda x: x.group(0).upper(),
+            "uppercase NADPH/NADH"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"\bmce-family\b", re.I),
+            "MCE-family",
+            "repair MCE-family"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"\bo-(\w+)\b", re.I),
+            lambda x: "O-" + x.group(1).lower(),
+            "lowercase O-* (example: o-methyltransferase)"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"\b(p{1,2}e) family protein", re.I),
+            lambda x: x.group(1).upper() + " family protein",
+            "repair PE family protein or PPE family protein"
+        ))
+        # the following names are erroneously capitalized in some hmmer entries
+        hmmerToLower = (
+            "Methylase",
+            "Corrin",
+            "Porphyrin",
+            "Active",
+        )
+        capList.extend(
+            [FilterReplace(
+                re.compile(r"\b%s\b" % htl, re.I),
+                lambda x: x.group(0).lower(),
+                "%s erroneously capitalized in some hmmer entries" % htl)
+            for htl in hmmerToLower]
+        )
+        capList.append(FilterReplace(
+            re.compile(r"\bHolliday\b", re.I),
+            lambda x: x.group(0).title(),
+            "fix some errors from earlier filter"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"\bpts\b", re.I),
+            lambda x: x.group(0).upper(),
+            "uppercase PTS"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"\bclp|ppx|ras\b", re.I),
+            lambda x: x.group(0).title(),
+            "title-case Clp, Ppx and Ras"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"\bei*cba|TFI*B\b", re.I),
+            lambda x: x.group(0).upper(),
+            "uppercase EI*CBA and TFI*B"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"\bmpt\d{2}\b", re.I),
+            lambda x: x.group(0).upper(),
+            "uppercase MPT##"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"\bcrispr\b", re.I),
+            lambda x: x.group(0).upper(),
+            "uppercase CRISPR"
+        ))
+        capList.append(FilterReplace(
+            re.compile(r"\babc\b", re.I),
+            lambda x: x.group(0).upper(),
+            "uppercase ABC"
+        ))
+        # title-cap all amino acid names
+        proteinNames = ["ala", "arg", "asn", "asp", "cys", "gln", "glu", "gly", "his",
+            "ile", "leu", "lys", "met", "phe", "pro", "ser", "thr", "trp", "tyr", "val"]
+        for pname in proteinNames:
+            capList.append(FilterReplace(
+                re.compile(r"\b%s\b" % pname, re.I),
+                lambda x: x.group(0).lower().capitalize(),
+                "title-capitalize amino acid names (%s)" % pname.capitalize()
+            ))
+
+        self.capitalizeList = capList
+
+        #
+        # Initial distillation/extraction rules.
+        # (Sometimes, people upload their excel spreadsheet to NCBI. Before major
+        # improvement can occur, we have to extract the relevant part of the name.)
+        # This list has to be before clause/punctuation cleanup, because such cleanup
+        # would likely leave the wrong part of the string as the main name
+        #
+        self.initialDistillationList = []
+        self.initialDistillationList.append(FilterReplace(
+            re.compile(r".*?[Ff]ull\=(.*?)(?:;|$).*"),
+            lambda x: x.group(1),
+            "extract useful info from database dumps including Full="
+        ))
+
+        #
+        # These rules fix typographical errors and are applied to all names.
+        #
+        self.typoList = []
+        # simple replacements
+        typos = [
+            # (why is transporter so frequently misspelled?)
+            (r"trans[p]?o[_r]?te[r]?", "transporter"),
+            (r"chmomosome", "chromosome"),
+            (r"put[aitv]+e", "putative"),
+            (r"protei\b", "protein"),
+            (r"prot[ei]+n\b", "protein"),
+            (r"ised\b", "ized"), # true especially for jazzercise
+            (r"bindingprotein", "binding protein"),
+            (r"[h]?hy[p]?ot[h]?etical", "hypothetical"),
+            (r"hypotehtical", "hypothetical"),
+            (r"signalling", "signaling"),
+            (r"oligoeptide", "oligopeptide"),
+            (r"dephosph\b", "dephospho"),
+            (r"glycosy\b", "glycosyl"),
+            (r"symport\b", "symporter"),
+            (r"asparate", "aspartate"),
+            (r"\bKd[a]?\b", "kDa"),
+            (r"\batpas\b", "ATPase"),
+            # prefer American English spelling
+            (r"\bhaemolysin\b", "hemolysin"),
+            (r"\bhaemagglutinin\b", "hemagglutinin"),
+            (r"aluminium", "aluminum"),
+            (r"utilis", "utiliz"),
+            (r"phosphopantethiene", "phosphopantetheine"),
+            (r"resistence", "resistence"),
+        ]
+        self.typoList.extend(
+            [FilterReplace(
+                re.compile(pat, re.I),
+                repl,
+                "typo: %s" % repl)
+             for pat, repl in typos]
+        )
+
+        #
+        # These clauses are not informative when automatically assigned.
+        # However, manual annotators can assign them from a file or annotation.
+        #
+        self.clauseRemovalList = []
+        removals = []
+        removals.append(FilterRemove(
+            re.compile(r"\b[Bb]ifunctional\s+protein\b"),
+            "trim: bi functional protein",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"low\s+molecular\s+weight"),
+            "trim: low molecular weight",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"low[-|\s]affinity"),
+            "trim: low affinity",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\bDNA\s+gyrase\b(?!.*subunit.*)", re.I),
+            "trim: DNA gyrase when not followed by subunit",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\b[\-]?truncated\b"),
+            "trim: truncated",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\bsubunits\b", re.I),
+            "trim: subunits",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"involved\s+in\s+.*"),
+            "trim everything following: involved in",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"[\d\-\.]+\s+kDa\s+(?!subunit)"),
+            "delete 70 kDa but allow 70 kDa subunit",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\b(mitochondrial\s*)?precursor\b"),
+            'trim: "precursor" and "mitochondrial precursor"',
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\-associated\s+region\b"),
+            "trim: -associated region",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\:subunit=.*"),
+            "trim everything following: :subunit=",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\band\s+inactivated\s+derivatives\b"),
+            "trim: and inactivated derivatives",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\band other.*"),
+            "trim everything following: and other",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\bassociated with.*"),
+            "trim everything following: associated with",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\band\s+\d\s+protein"),
+            "trim: and # protein",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\bfrom\b"),
+            "trim: from",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\(?photosystem\s+q\(a\)\s+protein\)?"),
+            "trim: photosystem q(a) protein",
+            skipif=[NOTAUTOGEN]
+        ))
+        # this is here because the low confidence filters are not active
+        # all the time
+        removals.append(FilterRemove(
+            re.compile(r"(5\'|3\'|5|3)-partial"),
+            "trim all partials referring to start and stop codons",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\s*[nNcC][ -]terminus"),
+            "trim all occurances of terminus when preceded by n or c",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"(\s+N)?\s+repeat$"),
+            "trim all repeat or N repeat",
+            skipif=[NOTAUTOGEN]
+        ))
+        #removals.append(FilterRemove(
+        #	re.compile(r"(?:\,)?\s+?paralogous\s+family"),
+        #	"trim paralogous families",
+        #	skipif=[NOTAUTOGEN]
+        #))
+        removals.append(FilterRemove(
+            re.compile(r"\bvery\b", re.I),
+            "trim: very",
+            skipif=[NOTAUTOGEN]
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\bvalidate[d]?\b", re.I),
+            "trim: validate",
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\bgene\b", re.I),
+            "trim: gene",
+        ))
+        removals.append(FilterRemove(
+            re.compile(r"\btruncat[ed]*\b", re.I),
+            "trim: truncate",
+        ))
+        self.clauseRemovalList.extend(removals)
+
+
+        #
+        # These clauses are never informative, no matter how assigned.
+        #
+        self.allClauseRemovalList = []
+        self.allClauseRemovalList.append(FilterRemove(
+            re.compile(r"[([]predicted[\])]", re.I),
+            "trim predicted and immediate surrounding parens"
+        ))
+        self.allClauseRemovalList.append(FilterRemove(
+            re.compile(r"[([]imported[\])]", re.I),
+            "trim imported and immediate surrounding parens"
+        ))
+
+
+        #
+        # Clause replacement - try to enforce consistent terminology.
+        # try to keep these clauses universally true, not just at beginning
+        # or end of string.
+        #
+        self.clauseReplaceList = []
+        replacers = []
+        replacers.append(FilterReplace(
+            re.compile(r"SAM\s+domain\s+[(]Sterile\s+alpha\smotif[)]"),
+            "SAM (sterile alpha motif) domain",
+            "special case from hmmer"
+        ))
+        replacers.append(FilterReplace(
+            re.compile(r"\bsubunit\s+family\b"),
+            "subunit",
+            "subunit family -> subunit"
+        ))
+        replacers.append(FilterReplace(
+            re.compile(r"^(.*?):\1", re.I),
+            lambda x: x.group(1),
+            "keep only half of duplicated strings"
+        ))
+        replacers.append(FilterReplace(
+            re.compile(r"protein\s+product"),
+            "protein",
+            "protein product -> product"
+        ))
+        replacers.append(FilterReplace(
+            re.compile(r"\bdomain\s+protein\b"),
+            "domain-containing protein",
+            "standardize to domain-containing protein"
+        ))
+        replacers.append(FilterReplace(
+            re.compile(r"\bdomain\s+containing\s+protein\b"),
+            "domain-containing protein",
+            "standardize to domain-containing protein"
+        ))
+        replacers.append(FilterReplace(
+            re.compile(r"\bmotif\b(?!\))"),
+            "domain-containing protein",
+            "motif -> domain-containing protein (unless motif in parens)"
+        ))
+        replacers.append(FilterReplace(
+            re.compile(r"\btransposase\s+mutator\s+type\b"),
+            "transposase",
+            "transposase mutator type to transposase"
+        ))
+        replacers.append(FilterReplace(
+            re.compile(r"\bdiacylglycerol\s+kinase\s+catalytic\s+region\b"),
+            "diacylglycerol kinase",
+            "diacylglycerol kinase catalytic region to diacylglycerol kinase"
+        ))
+        # remove "family protein" or "protein" following any "-ase" word
+        # shortest valid one I can think of is "kinase"
+        replacers.append(FilterReplace(
+            re.compile(r"\b(\w{3,}ase)(\s+family)?\s+protein"),
+            lambda x: x.group(1),
+            "remove protein or family protein following any -ase"
+        ))
+        # the following rules were too picky for swiss-prot
+#		replacers.append(FilterReplace(
+#			re.compile(r"\b(.*?ase)\s+(.)\s+chain\b"),
+#			lambda x: "%s subunit %s" % (x.group(1), x.group(2)),
+#			"Xase Y chain -> Xase subunit Y"
+#		))