Files changed (2)
- **``cleaner``** takes a name and applies the full list of filters to it. A name can be filtered to an empty string by this function; the output of the command will tell you why. Names that are filtered to nothing are ones Genepidgin considers to be uninformative.
Setting the -d flag indicates that genepidgin should return a default name (``"hypothetical protein"``) when a name would otherwise be blank.
First, we split the names up by spaces, remove EC numbers and punctuation and other sorts of extra characters, convert everything to lowercase, etc.
In this step we strike out words that are only useful in a grammatical sense, including *an, and, in, is, of, the,* etc. We also remove weasel words, such as *generic, hypothetical, related,* etc. Finally, we remove glue words, such as *associated, class, component, protein, system,* and *type.* When these words are stripped we are left with a "core" name that identifies the protein; different namers may use different glue words to format the core name and we ignore those.
Finding the best edit distance between two names of, say, 4 tokens each is a bit tricky, because it's possible that the lowest cumulative edit distance will involve one or more sub-optimal individual token matches. In fact there are cases where the lowest distance is composed entirely of sub-optimal token pairings. So we need to try a lot of combinations. To do this we precompute two scores for each pair of tokens, and build two *n* × *n* matrices to hold them. We then score all possible paths with distinct pairwise token pairings via these matrices. For each path we combine two scores: we try to minimize the normalized edit distance between token pairs, and we try to maximize the length of the longest pairwise common substrings between pairs of tokens.
-#. report a single number between 0 and 1 (inclusive) summarizing the distance between the two names.
+#. report a single number between 0 and 1 (inclusive) summarizing the distance between the two names.
A perfect token-token match is really good. A lot of perfect matches are really, really good. Long common substrings are fairly good. The Damerau-Levenshtein distance can return higher distances than we might like for these three types of token matches. On the other hand, maximizing the length of the longest common substring(s) has its own set of problems. After a great deal of trial and error, we have settled on the following equation, which has worked well on genome-scale scoring studies across a variety of prokaryotes.