Clint Howarth  committed 06db768

documentation correction and improvements, because that's a lot of text

  • Participants
  • Parent commits 5fc38c0

Comments (0)

Files changed (2)

File docs/cleaner.rst

 - **``cleaner``** takes a name and applies the full list of filters to it. A name can be filtered to an empty string by this function; the output of the command will tell you why. Names that are filtered to nothing are ones Genepidgin considers to be uninformative.
-   ::
-       $ genepidgin cleaner <inputfile>
+    $ genepidgin cleaner <inputfile>
 Setting the -d flag indicates that genepidgin should return a default name (``"hypothetical protein"``) when a name would otherwise be blank.
     $ genepidgin cleaner -h
     usage: cleaner [-h] [--silent] [--default] input output

File docs/compare.rst

 In more detail:
-#. decompose each name into tokens
+#. **decompose each name into tokens**
    First, we split the names up by spaces, remove EC numbers and punctuation and other sorts of extra characters, convert everything to lowercase, etc.
    **in:** "Ribosomal protein, S23-type"
-    **out:** "ribosomal" · "protein" · "s23-type"
+   **out:** "ribosomal" · "protein" · "s23-type"
-#. remove uninformative tokens
+#. **remove uninformative tokens**
    In this step we strike out words that are only useful in a grammatical sense, including *an, and, in, is, of, the,* etc. We also remove weasel words, such as *generic, hypothetical, related,* etc.  Finally, we remove glue words, such as *associated, class, component, protein, system,* and *type.* When these words are stripped we are left with a "core" name that identifies the protein; different namers may use different glue words to format the core name and we ignore those.
    **in:** "ribosomal" · "protein" · "s23-type"
-    **out:** "ribosomal" · "s23"
+   **out:** "ribosomal" · "s23"
    Because we strip out noninformative tokens, we count all of the following strings as equal.
    -  "hypothetical protein"
    -  "conserved hypothetical protein"
-#. rearrange the tokens in such a way as to ...
+#. **rearrange the tokens in such a way as to ...**
    Finding the best edit distance between two names of, say, 4 tokens each is a bit tricky, because it's possible that the lowest cumulative edit distance will involve one or more sub-optimal individual token matches. In fact there are cases where the lowest distance is composed entirely of sub-optimal token pairings. So we need to try a lot of combinations. To do this we precompute two scores for each pair of tokens, and build two *n* × *n* matrices to hold them. We then score all possible paths with distinct pairwise token pairings via these matrices. For each path we combine two scores: we try to minimize the normalized edit distance between token pairs, and we try to maximize the length of the longest pairwise common substrings between pairs of tokens.
    Note that token order has no effect on the distance between two names.
-#. report a single number between 0 and 1 (inclusive) summarizing the distance between the two names.
+#. **report a single number between 0 and 1 (inclusive) summarizing the distance between the two names.**
    A perfect token-token match is really good. A lot of perfect matches are really, really good. Long common substrings are fairly good. The Damerau-Levenshtein distance can return higher distances than we might like for these three types of token matches. On the other hand, maximizing the length of the longest common substring(s) has its own set of problems. After a great deal of trial and error, we have settled on the following equation, which has worked well on genome-scale scoring studies across a variety of prokaryotes.