sequence has more residues than PDB has Calpha's
when read.fasta.pdb() issues the warning sequence has more residues than PDB has Calpha's, the returned resno element differs from the ali element (ali is returned unchanged from the input alignment argument). would it make sense to correct the returned ali so that it corresponds to what is in the PDB?
this occurs when I use mustang for the sequence alignment with pdb input files containing residues without resolved C-alpha atoms (but e.g. only an N atom). it's not a big issue since gap.inspect(pdbs$resno) can be used instead of gap.inspect(pdbs$ali), and it might be justified to keep the ali unchanged. hmm...
Comments (24)
-
-
Probably should have this option, along with an amended warning message to indicate edit.
-
I temporarily fixed it with an option, fix.ali: if TRUE, check the consistence between $ali and $resno, and if they don't match correct $ali gap positions. However, this only works when missing CA atoms are on the C terminus, and will give an error otherwise, e.g.
> npdbs1 <- read.fasta.pdb(aln1) pdb/seq: 1 name: CA_missing_Cterm.pdb pdb/seq: 2 name: other.pdb Warning message: In FUN(1:2[[2L]], ...) : CA_missing_Cterm.pdb : sequence has more residues than PDB has Calpha's > npdbs1$ali [,342] [,343] [,344] [,345] [,346] [,347] [,348] [,349] CA_missing_Cterm.pdb "I" "I" "I" "K" "E" "N" "L" "K" other.pdb "I" "I" "I" "K" "E" "-" "-" "-" [,350] [,351] [,352] [,353] [,354] CA_missing_Cterm.pdb "D" "C" "G" "L" "-" other.pdb "-" "-" "-" "-" "-" [,342] [,343] [,344] [,345] [,346] [,347] [,348] [,349] CA_missing_Cterm.pdb "338" "339" "340" "341" "342" "343" "344" "345" other.pdb "338" "339" "340" "341" "342" NA NA NA [,350] [,351] [,352] [,353] [,354] CA_missing_Cterm.pdb "346" "347" NA NA NA other.pdb NA NA NA NA NA > npdbs1 <- read.fasta.pdb(aln1, fix.ali=TRUE) pdb/seq: 1 name: CA_missing_Cterm.pdb pdb/seq: 2 name: other.pdb Warning messages: 1: In FUN(1:2[[2L]], ...) : CA_missing_Cterm.pdb : sequence has more residues than PDB has Calpha\'s 2: In read.fasta.pdb(aln1, fix.ali = TRUE) : $ali component is modified to match $resno > npdbs1$ali [,342] [,343] [,344] [,345] [,346] [,347] [,348] [,349] CA_missing_Cterm.pdb "I" "I" "I" "K" "E" "N" "L" "K" other.pdb "I" "I" "I" "K" "E" "-" "-" "-" [,350] [,351] [,352] [,353] [,354] CA_missing_Cterm.pdb "D" "C" "-" "-" "-" other.pdb "-" "-" "-" "-" "-" > npdbs2 <- read.fasta.pdb(aln2, fix.ali=TRUE) pdb/seq: 1 name: CA_missing_middle.pdb ERROR: CA_missing_middle.pdb alignment and pdb sequences do not match ...
-
Going back to the original question here. It is clear that MUSTANG needs 'cleaned' PDB input files so I think this should be taken care of in the actual mustang() bio3d function itself.
Obviously, such a mismatch could still occur in which case the current error message is spot on.
However, I have been wanting to provide a better format of Error message output showing the sequence-structure miss-match along the lines of the print.fasta() but with conservation shown (as we discussed elsewhere).
I can put this on the ToDo list along with any suggestions for what might be helpful here...
-
reporter right.. perhaps I can change mustang to write CA coordinates to disk, and use these as input instead.
-
reporter -
assigned issue to
-
assigned issue to
-
Is this still an open issue guys?
-
reporter I think so... I will check
-
reporter ids = c("1cdk_A", "3dnd_A") rawfiles <- get.pdb(ids) files <- pdbsplit(rawfiles, ids) aln1 <- mustang(files, cleanpdb=TRUE) pdbs1 <- read.fasta.pdb(aln1) aln2 <- mustang(files, cleanpdb=FALSE) pdbs2 <- read.fasta.pdb(aln2)
Proposed workaround is to map any non-standard residues to standard residues (e.g. SEP->SER, etc). Not ideal, since atom.select(pdb, "calpha") omits non-standard residues which might in fact be a residue.
-
reporter issue linked to #43
-
reporter Error reporting has been updated. See commit 216bb88.
I think this should be ok now, but feel free to test it.
> ids = c("1cdk_A", "3dnd_A") > rawfiles <- get.pdb(ids) > files <- pdbsplit(rawfiles, ids) > aln2 <- mustang(files, cleanpdb=FALSE) > pdbs2 <- read.fasta.pdb(aln2) pdb/seq: 1 name: split_chain/1cdk_A.pdb ERROR Alignment mismatch. See alignment below for further details (row 1 of aln and sequence of 'split_chain/1cdk_A.pdb'). First mismatch residue in PDB is: TPO-197 (A) occurring at alignment position: 190 1 . . . . . . 70 aliseq KGSEQESVKEFLAKAKEDFLKKWENPAQNTAHLDQFERIKTLGTGSFGRVMLVKHKETGNHFAMKILDKQ pdbseq KGSEQESVKEFLAKAKEDFLKKWENPAQNTAHLDQFERIKTLGTGSFGRVMLVKHKETGNHFAMKILDKQ ********************************************************************** 1 . . . . . . 70 71 . . . . . . 140 aliseq KVVKLKQIEHTLNEKRILQAVNFPFLVKLEYSFKDNSNLYMVMEYVPGGEMFSHLRRIGRFSEPHARFYA pdbseq KVVKLKQIEHTLNEKRILQAVNFPFLVKLEYSFKDNSNLYMVMEYVPGGEMFSHLRRIGRFSEPHARFYA ********************************************************************** 71 . . . . . . 140 141 . . . . . . 210 aliseq AQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTDFGFAKRVKGRTWLCGTPEYLAPEIILSKGYNKA pdbseq AQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTDFGFAKRVKGRTWTLCGTPEYLAPEIILSKGYNK ************************************************* *^ 141 . . . . . . 210 211 . . . . . . 280 aliseq VDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGKVRFPSHFSSDLKDLLRNLLQVDLTKRFGNLKDGV pdbseq AVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGKVRFPSHFSSDLKDLLRNLLQVDLTKRFGNLKDG * ^^ *^ * * ^ * * * ^ 211 . . . . . . 280 281 . . . . . . 342 aliseq NDIKNHKWFATTDWIAIYQRKVEAPFIPKFKGPGDTSNFDDYEEEEIRVSINEKCGKEFSEF pdbseq VNDIKNHKWFATTDWIAIYQRKVEAPFIPKFKGPGDTSNFDDYEEEEIRVSINEKCGKEFSE ^ ^ * ^ ^ * *** 281 . . . . . . 342 Error: 1cdk_A alignment and pdb sequences do not match
-
Looks great!
-
Indeed, this is much better. In the last line could also add the position information as sometimes the volumes output can lead to these being missed. E.g. The last line could say something like:
Error: 1cdk_A alignment and PDB sequence miss-match beginning position 190 (PDB RESNO 197)
-
reporter Ok. I'll add that. I also just realized that if the mismatch occurs in residues 1:15, the error message is just:
starting residues of sequence not found in PDB
> ids <- c("1rx2_A", "1rx4_A", "1rg7_A") > raw.files <- get.pdb(ids, path="raw_pdbs") > files <- pdbsplit( raw.files, ids ) > pdbs <- pdbaln(files) > pdbs$ali[1,5] <- "-" > tmp <- read.fasta.pdb(pdbs) pdb/seq: 1 name: split_chain/1rx2_A.pdb PDB has ALT records, taking A only, rm.alt=TRUE Error: split_chain/1rx2_A.pdb : starting residues of sequence not found in PDB
-
reporter Updated. Please check my revision in commit
> tmp <- read.fasta.pdb(pdbs) pdb/seq: 1 name: split_chain/1rx2_A.pdb PDB has ALT records, taking A only, rm.alt=TRUE ERROR Alignment mismatch. See alignment below for further details (row 1 of aln and sequence of 'split_chain/1rx2_A.pdb'). First mismatch residue in PDB is: ILE-5 (A) occurring at alignment position: 6 1 . . . . . . 70 aliseq MISL-AALAVDRVIGMENAMPWNLPADLAWFKRNTLDKPVIMGRHTWESIGRPLPGRKNIILSSQPGTDD pdbseq MISL-IAALAVDRVIGMENAMPWNLPADLAWFKRNTLDKPVIMGRHTWESIGRPLPGRKNIILSSQPGTD ****^ * ^ ^ ^ ^^ ^ ^ *^ * * 1 . . . . . . 70 71 . . . . . . 140 aliseq RVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSEF pdbseq DRVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSE ^ * ^^^ ** * 71 . . . . . . 140 141 . 159 aliseq HDADAQNSHSYCFEILERR pdbseq FHDADAQNSHSYCFEILER ^ ^ * 141 . 159 Error: 1rx2_A alignment and PDB sequence miss-match beginning at position 6 (PDB RESNO ILE-5 (A))
-
This looks like a good improvement from my quick tests (see below), and the detailed sequence print out is obviously more helpful than before. At the moment we fail if we have missing residues anywhere BUT the C-termini (again see below, Error with N-termini or middle missing but Warning with C-termini missing) - is this what we want?
library(bio3d) ids <- c("1rx2_A", "1rx4_A", "1rg7_A") raw.files <- get.pdb(ids, path="raw_pdbs") files <- pdbsplit( raw.files, ids ) pdbs <- pdbaln(files) npdbs <- pdbs id <- npdbs$id[1] pdb <- read.pdb(id) ##- Remove N-termini residues test npdb <- trim.pdb(pdb, resno=1:5, inverse=TRUE) write.pdb(npdb, file="tmp.pdb") tmp <- read.fasta.pdb(npdbs) # Error: tmp alignment and PDB sequence miss-match # beginning at position 1 (PDB RESNO ALA-6 (A)) # In addition: Warning message: # tmp.pdb : sequence has more residues than PDB has Calpha's ##- Remove C-termini residues test npdb <- trim.pdb(pdb, resno=155:159, inverse=TRUE) write.pdb(npdb, file="tmp.pdb") tmp <- read.fasta.pdb(npdbs) # Warning message: # tmp.pdb : sequence has more residues than PDB has Calpha's ##- Remove middle residues test npdb <- trim.pdb(pdb, resno=30:35, inverse=TRUE) write.pdb(npdb, file="tmp.pdb") tmp <- read.fasta.pdb(npdbs) # Error: tmp alignment and PDB sequence miss-match # beginning at position 30 (PDB RESNO LEU-36 (A)) # In addition: Warning message: # tmp.pdb : sequence has more residues than PDB has Calpha's
I guess having the C-termini pass with just a warning could come back to haunt people later on if they are checking for missing residues in the alignment and the the coordinates/resno etc....
-
I guess it might be better to be consistent, i.e. stopping for mismatching at any place, isn't it?
-
reporter - changed component to ToDo
- marked as enhancement
- changed version to v2.2 [devel]
-
- changed version to v2.2
-
Should we proceed with the suggestion, i.e. stopping and giving an error when a mismatch occurs between sequence alignment and pdb structure, wherever the mismatch is located?
-
can you summarize where we stand currently - looks like original issue is solved but other things (inconsistencies) were noted in the thread. Thanks!
-
Yes. The main issue discussed here is how the function read.fasta.pdb() deals with the mismatch between sequence alignment and pdb structures. The current strategy is:
- If all the amino acids in the sequence can find matched residues in PDB, no message.
- If some amino acids at the C-terminal of the sequence cannot find matched residues in PDB, it just shoots a warning.
- For all other cases, it stops and shoots an error message with detailed information about the mismatch.
The suggestion is to make it more consistent, i.e. wherever the mismatch occurs the function stops and shoots error. What do you think?
-
Thanks for summarizing!
I actually like it the way it is (with an exit and useful miss-match message for all but the C-term). Allowing some ambiguity at the ends is useful. We could have a
strict=TRUE
option in the future if run into some problems with this down the road.Happy to close this issue
-
- changed status to resolved
We leave the way it is, i.e. allowing some ambiguity at the ends. We could have a
strict=TRUE
option in the future if run into some problems with this down the road. - Log in to comment
Never think about this... Maybe an option to toggle the correction of $ali? By default, it should be unchanged but print warning message to indicate the inconsistency.