-#. decompose each name into tokens
+#. decompose each name into tokens
First, we split the names up by spaces, remove EC numbers and punctuation and other sorts of extra characters, convert everything to lowercase, etc.
**in:** "Ribosomal protein, S23-type"
**out:** "ribosomal" · "protein" · "s23-type"
+ **out:** "ribosomal" · "protein" · "s23-type"
-#. remove uninformative tokens
+#. remove uninformative tokens
In this step we strike out words that are only useful in a grammatical sense, including *an, and, in, is, of, the,* etc. We also remove weasel words, such as *generic, hypothetical, related,* etc. Finally, we remove glue words, such as *associated, class, component, protein, system,* and *type.* When these words are stripped we are left with a "core" name that identifies the protein; different namers may use different glue words to format the core name and we ignore those.
**in:** "ribosomal" · "protein" · "s23-type"
**out:** "ribosomal" · "s23"
+ **out:** "ribosomal" · "s23"
Because we strip out noninformative tokens, we count all of the following strings as equal.
- "conserved hypothetical protein"
-#. rearrange the tokens in such a way as to ...
+#. rearrange the tokens in such a way as to ...
Finding the best edit distance between two names of, say, 4 tokens each is a bit tricky, because it's possible that the lowest cumulative edit distance will involve one or more sub-optimal individual token matches. In fact there are cases where the lowest distance is composed entirely of sub-optimal token pairings. So we need to try a lot of combinations. To do this we precompute two scores for each pair of tokens, and build two *n* × *n* matrices to hold them. We then score all possible paths with distinct pairwise token pairings via these matrices. For each path we combine two scores: we try to minimize the normalized edit distance between token pairs, and we try to maximize the length of the longest pairwise common substrings between pairs of tokens.
Note that token order has no effect on the distance between two names.
-#. report a single number between 0 and 1 (inclusive) summarizing the distance between the two names.
+#. report a single number between 0 and 1 (inclusive) summarizing the distance between the two names.
A perfect token-token match is really good. A lot of perfect matches are really, really good. Long common substrings are fairly good. The Damerau-Levenshtein distance can return higher distances than we might like for these three types of token matches. On the other hand, maximizing the length of the longest common substring(s) has its own set of problems. After a great deal of trial and error, we have settled on the following equation, which has worked well on genome-scale scoring studies across a variety of prokaryotes.