how the function seqidentity() calculate sequence identity and similarity?
From the description, I guess the sequence identity or identity value is "a single numeric score determined for each pair of aligned sequences. It measures the number of identical residues (“matches”) in relation to the length of the alignment." But how about similarity? how the function calculate it?
Comments (5)
-
-
-
What Lars describes above is for H.10 entropy10 conservation score.
The “similarity” score is defined as the average of the similarity scores of all pairwise residue comparisons for that position in the alignment, where the similarity score between any two residues is the score value between those residues in the chosen substitution matrix “sub.matrix” (which can be set to "bio3d", "blosum62" or "pam30").
-
- changed status to resolved
Has been explained clearly.
-
After looking at the code I think that for the function
seqidentity()
what Lars has mentioned is correct. What Barry has mentioned about the “sub.matrix“ is applicable for theconserv()
function. Is there any possibility in future to compute the sequence similarity using substitution matrix?
- Log in to comment
Hi, For sequence similarity the amino acid residues from the 22-letter alphabet are classified into one of 10 types, loosely following the convention of Mirny and Shakhnovich (1999): Hydrophobic/Aliphatic [V,I,L,M], Aromatic [F,W,Y], Ser/Thr [S,T], Polar [N,Q], Positive [H,K,R], Negative [D,E], Tiny [A,G], Proline [P], Cysteine [C], and Gaps [-,X]. After this classification the calculation is done in the same way as you describe. see also function
entropy()
for more on this. Lars