how the function seqidentity() calculate sequence identity and similarity?

Issue #255 resolved
Former user created an issue

From the description, I guess the sequence identity or identity value is "a single numeric score determined for each pair of aligned sequences. It measures the number of identical residues (“matches”) in relation to the length of the alignment." But how about similarity? how the function calculate it?

Comments (5)

  1. Lars Skjærven

    Hi, For sequence similarity the amino acid residues from the 22-letter alphabet are classified into one of 10 types, loosely following the convention of Mirny and Shakhnovich (1999): Hydrophobic/Aliphatic [V,I,L,M], Aromatic [F,W,Y], Ser/Thr [S,T], Polar [N,Q], Positive [H,K,R], Negative [D,E], Tiny [A,G], Proline [P], Cysteine [C], and Gaps [-,X]. After this classification the calculation is done in the same way as you describe. see also function entropy() for more on this. Lars

  2. Barry Grant

    What Lars describes above is for H.10 entropy10 conservation score.

    The “similarity” score is defined as the average of the similarity scores of all pairwise residue comparisons for that position in the alignment, where the similarity score between any two residues is the score value between those residues in the chosen substitution matrix “sub.matrix” (which can be set to "bio3d", "blosum62" or "pam30").

  3. Swarnendu Tripathi

    After looking at the code I think that for the function seqidentity() what Lars has mentioned is correct. What Barry has mentioned about the “sub.matrix“ is applicable for the conserv() function. Is there any possibility in future to compute the sequence similarity using substitution matrix?

  4. Log in to comment