sequence has more residues than PDB has Calpha's

Issue #82 resolved

Lars Skjærven created an issue 2014-03-27

when read.fasta.pdb() issues the warning sequence has more residues than PDB has Calpha's, the returned resno element differs from the ali element (ali is returned unchanged from the input alignment argument). would it make sense to correct the returned ali so that it corresponds to what is in the PDB?

this occurs when I use mustang for the sequence alignment with pdb input files containing residues without resolved C-alpha atoms (but e.g. only an N atom). it's not a big issue since gap.inspect(pdbs$resno) can be used instead of gap.inspect(pdbs$ali), and it might be justified to keep the ali unchanged. hmm...

Comments (24)

Xinqiu Yao
Never think about this... Maybe an option to toggle the correction of $ali? By default, it should be unchanged but print warning message to indicate the inconsistency.
- 2014-03-27T16:45:17+00:00
Barry Grant
Probably should have this option, along with an amended warning message to indicate edit.
- 2014-05-14T21:14:54+00:00

Xinqiu Yao

I temporarily fixed it with an option, fix.ali: if TRUE, check the consistence between $ali and $resno, and if they don't match correct $ali gap positions. However, this only works when missing CA atoms are on the C terminus, and will give an error otherwise, e.g.

> npdbs1 <- read.fasta.pdb(aln1)
pdb/seq: 1   name: CA_missing_Cterm.pdb 
pdb/seq: 2   name: other.pdb 
Warning message:
In FUN(1:2[[2L]], ...) :
  CA_missing_Cterm.pdb : sequence has more residues than PDB has Calpha's
> npdbs1$ali
                     [,342] [,343] [,344] [,345] [,346] [,347] [,348] [,349]
CA_missing_Cterm.pdb "I"    "I"    "I"    "K"    "E"    "N"    "L"    "K"   
other.pdb            "I"    "I"    "I"    "K"    "E"    "-"    "-"    "-"   
                     [,350] [,351] [,352] [,353] [,354]
CA_missing_Cterm.pdb "D"    "C"    "G"    "L"    "-"   
other.pdb            "-"    "-"    "-"    "-"    "-"   
                     [,342] [,343] [,344] [,345] [,346] [,347] [,348] [,349]
CA_missing_Cterm.pdb "338"  "339"  "340"  "341"  "342"  "343"  "344"  "345" 
other.pdb            "338"  "339"  "340"  "341"  "342"  NA     NA     NA    
                     [,350] [,351] [,352] [,353] [,354]
CA_missing_Cterm.pdb "346"  "347"  NA     NA     NA    
other.pdb            NA     NA     NA     NA     NA    

> npdbs1 <- read.fasta.pdb(aln1, fix.ali=TRUE)
pdb/seq: 1   name: CA_missing_Cterm.pdb 
pdb/seq: 2   name: other.pdb 
Warning messages:
1: In FUN(1:2[[2L]], ...) :
  CA_missing_Cterm.pdb : sequence has more residues than PDB has Calpha\'s
2: In read.fasta.pdb(aln1, fix.ali = TRUE) :
  $ali component is modified to match $resno
> npdbs1$ali

                     [,342] [,343] [,344] [,345] [,346] [,347] [,348] [,349]
CA_missing_Cterm.pdb "I"    "I"    "I"    "K"    "E"    "N"    "L"    "K"   
other.pdb            "I"    "I"    "I"    "K"    "E"    "-"    "-"    "-"   
                     [,350] [,351] [,352] [,353] [,354]
CA_missing_Cterm.pdb "D"    "C"    "-"    "-"    "-"   
other.pdb            "-"    "-"    "-"    "-"    "-"   

> npdbs2 <- read.fasta.pdb(aln2, fix.ali=TRUE)
pdb/seq: 1   name: CA_missing_middle.pdb 
ERROR: CA_missing_middle.pdb alignment and pdb sequences do not match 
...

2014-05-15T00:02:47+00:00

Barry Grant
Going back to the original question here. It is clear that MUSTANG needs 'cleaned' PDB input files so I think this should be taken care of in the actual mustang() bio3d function itself.

Obviously, such a mismatch could still occur in which case the current error message is spot on.

However, I have been wanting to provide a better format of Error message output showing the sequence-structure miss-match along the lines of the print.fasta() but with conservation shown (as we discussed elsewhere).

I can put this on the ToDo list along with any suggestions for what might be helpful here...
- 2014-05-28T02:34:04+00:00
Lars Skjærven reporter
right.. perhaps I can change mustang to write CA coordinates to disk, and use these as input instead.
- 2014-05-29T10:26:55+00:00
Lars Skjærven reporter
- assigned issue to
  
  Lars Skjærven
- 2014-05-29T10:27:02+00:00
Barry Grant
Is this still an open issue guys?
- 2014-10-31T15:52:53+00:00
Lars Skjærven reporter
I think so... I will check
- 2014-10-31T19:54:58+00:00

Lars Skjærven reporter

ids = c("1cdk_A", "3dnd_A")
rawfiles <- get.pdb(ids)
files <- pdbsplit(rawfiles, ids)

aln1 <- mustang(files, cleanpdb=TRUE)
pdbs1 <- read.fasta.pdb(aln1)

aln2 <- mustang(files, cleanpdb=FALSE)
pdbs2 <- read.fasta.pdb(aln2)

Proposed workaround is to map any non-standard residues to standard residues (e.g. SEP->SER, etc). Not ideal, since atom.select(pdb, "calpha") omits non-standard residues which might in fact be a residue.

2014-11-18T19:55:29+00:00

Lars Skjærven reporter
issue linked to #43
- 2014-11-24T19:48:29+00:00

Lars Skjærven reporter

Error reporting has been updated. See commit 216bb88.

I think this should be ok now, but feel free to test it.

> ids = c("1cdk_A", "3dnd_A")
> rawfiles <- get.pdb(ids)
> files <- pdbsplit(rawfiles, ids)

> aln2 <- mustang(files, cleanpdb=FALSE)
> pdbs2 <- read.fasta.pdb(aln2)
pdb/seq: 1   name: split_chain/1cdk_A.pdb 

 ERROR   Alignment mismatch. See alignment below for further details
         (row 1 of aln and sequence of 'split_chain/1cdk_A.pdb').
         First mismatch residue in PDB is: TPO-197 (A) 
         occurring at alignment position: 190 

          1        .         .         .         .         .         .         70 
aliseq   KGSEQESVKEFLAKAKEDFLKKWENPAQNTAHLDQFERIKTLGTGSFGRVMLVKHKETGNHFAMKILDKQ
pdbseq   KGSEQESVKEFLAKAKEDFLKKWENPAQNTAHLDQFERIKTLGTGSFGRVMLVKHKETGNHFAMKILDKQ
         ********************************************************************** 
         1        .         .         .         .         .         .         70 

        71        .         .         .         .         .         .         140 
aliseq   KVVKLKQIEHTLNEKRILQAVNFPFLVKLEYSFKDNSNLYMVMEYVPGGEMFSHLRRIGRFSEPHARFYA
pdbseq   KVVKLKQIEHTLNEKRILQAVNFPFLVKLEYSFKDNSNLYMVMEYVPGGEMFSHLRRIGRFSEPHARFYA
         ********************************************************************** 
        71        .         .         .         .         .         .         140 

       141        .         .         .         .         .         .         210 
aliseq   AQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTDFGFAKRVKGRTWLCGTPEYLAPEIILSKGYNKA
pdbseq   AQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTDFGFAKRVKGRTWTLCGTPEYLAPEIILSKGYNK
         *************************************************            *^        
       141        .         .         .         .         .         .         210 

       211        .         .         .         .         .         .         280 
aliseq   VDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGKVRFPSHFSSDLKDLLRNLLQVDLTKRFGNLKDGV
pdbseq   AVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGKVRFPSHFSSDLKDLLRNLLQVDLTKRFGNLKDG
            *    ^^    *^  * *           ^           *     *   *      ^         
       211        .         .         .         .         .         .         280 

       281        .         .         .         .         .         . 342 
aliseq   NDIKNHKWFATTDWIAIYQRKVEAPFIPKFKGPGDTSNFDDYEEEEIRVSINEKCGKEFSEF
pdbseq   VNDIKNHKWFATTDWIAIYQRKVEAPFIPKFKGPGDTSNFDDYEEEEIRVSINEKCGKEFSE
               ^ ^  *        ^               ^   *  ***                 
       281        .         .         .         .         .         . 342 

Error: 1cdk_A alignment and pdb sequences do not match

2015-02-15T17:47:18+00:00

Xinqiu Yao
Looks great!
- 2015-02-16T15:13:41+00:00
Barry Grant
Indeed, this is much better. In the last line could also add the position information as sometimes the volumes output can lead to these being missed. E.g. The last line could say something like:
```
Error: 1cdk_A alignment and PDB sequence miss-match beginning position 190 (PDB RESNO 197)
```
- 2015-02-16T17:10:36+00:00

Lars Skjærven reporter

Ok. I'll add that. I also just realized that if the mismatch occurs in residues 1:15, the error message is just:

starting residues of sequence not found in PDB

> ids <- c("1rx2_A", "1rx4_A", "1rg7_A") 
> raw.files <- get.pdb(ids, path="raw_pdbs")
> files     <- pdbsplit( raw.files, ids )
> pdbs <- pdbaln(files)
> pdbs$ali[1,5] <- "-"

> tmp <- read.fasta.pdb(pdbs)
pdb/seq: 1   name: split_chain/1rx2_A.pdb 
   PDB has ALT records, taking A only, rm.alt=TRUE
Error: split_chain/1rx2_A.pdb : starting residues of sequence not found in PDB

2015-02-17T12:52:16+00:00

Lars Skjærven reporter

Updated. Please check my revision in commit

> tmp <- read.fasta.pdb(pdbs)
pdb/seq: 1   name: split_chain/1rx2_A.pdb 
   PDB has ALT records, taking A only, rm.alt=TRUE

 ERROR   Alignment mismatch. See alignment below for further details
         (row 1 of aln and sequence of 'split_chain/1rx2_A.pdb').
         First mismatch residue in PDB is: ILE-5 (A) 
         occurring at alignment position: 6 

         1        .         .         .         .         .         .         70 
aliseq   MISL-AALAVDRVIGMENAMPWNLPADLAWFKRNTLDKPVIMGRHTWESIGRPLPGRKNIILSSQPGTDD
pdbseq   MISL-IAALAVDRVIGMENAMPWNLPADLAWFKRNTLDKPVIMGRHTWESIGRPLPGRKNIILSSQPGTD
         ****^ *      ^                ^ ^       ^^  ^            ^  *^ *     * 
         1        .         .         .         .         .         .         70 

        71        .         .         .         .         .         .         140 
aliseq   RVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSEF
pdbseq   DRVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSE
                  ^   *       ^^^ **                                  *         
        71        .         .         .         .         .         .         140 

       141        .        159 
aliseq   HDADAQNSHSYCFEILERR
pdbseq   FHDADAQNSHSYCFEILER
               ^        ^  * 
       141        .        159 

Error: 1rx2_A alignment and PDB sequence miss-match
       beginning at position 6 (PDB RESNO ILE-5 (A))

2015-02-17T13:14:50+00:00

Barry Grant

This looks like a good improvement from my quick tests (see below), and the detailed sequence print out is obviously more helpful than before. At the moment we fail if we have missing residues anywhere BUT the C-termini (again see below, Error with N-termini or middle missing but Warning with C-termini missing) - is this what we want?

library(bio3d)
ids <- c("1rx2_A", "1rx4_A", "1rg7_A")
raw.files <- get.pdb(ids, path="raw_pdbs")
files     <- pdbsplit( raw.files, ids )
pdbs <- pdbaln(files)
npdbs <- pdbs
id <- npdbs$id[1]
pdb <- read.pdb(id)

##- Remove N-termini residues test
npdb <- trim.pdb(pdb, resno=1:5, inverse=TRUE)
write.pdb(npdb, file="tmp.pdb")
tmp <- read.fasta.pdb(npdbs)

# Error: tmp alignment and PDB sequence miss-match
#       beginning at position 1 (PDB RESNO ALA-6 (A))
# In addition: Warning message:
# tmp.pdb : sequence has more residues than PDB has Calpha's


##- Remove C-termini residues test
npdb <- trim.pdb(pdb, resno=155:159, inverse=TRUE)
write.pdb(npdb, file="tmp.pdb")
tmp <- read.fasta.pdb(npdbs)

# Warning message:
# tmp.pdb : sequence has more residues than PDB has Calpha's


##- Remove middle residues test
npdb <- trim.pdb(pdb, resno=30:35, inverse=TRUE)
write.pdb(npdb, file="tmp.pdb")
tmp <- read.fasta.pdb(npdbs)

# Error: tmp alignment and PDB sequence miss-match
#       beginning at position 30 (PDB RESNO LEU-36 (A))
# In addition: Warning message:
# tmp.pdb : sequence has more residues than PDB has Calpha's

I guess having the C-termini pass with just a warning could come back to haunt people later on if they are checking for missing residues in the alignment and the the coordinates/resno etc....

2015-02-17T17:12:23+00:00

Xinqiu Yao
I guess it might be better to be consistent, i.e. stopping for mismatching at any place, isn't it?
- 2015-02-17T17:30:24+00:00
Lars Skjærven reporter
- changed component to ToDo
- marked as enhancement
- changed version to v2.2 [devel]
- 2015-02-26T13:26:30+00:00
Xinqiu Yao
- changed version to v2.2
- 2016-09-21T15:00:20+00:00
Xinqiu Yao
Should we proceed with the suggestion, i.e. stopping and giving an error when a mismatch occurs between sequence alignment and pdb structure, wherever the mismatch is located?
- 2016-09-27T16:40:12+00:00
Barry Grant
can you summarize where we stand currently - looks like original issue is solved but other things (inconsistencies) were noted in the thread. Thanks!
- 2016-09-27T16:56:37+00:00
Xinqiu Yao
Yes. The main issue discussed here is how the function read.fasta.pdb() deals with the mismatch between sequence alignment and pdb structures. The current strategy is:
- If all the amino acids in the sequence can find matched residues in PDB, no message.
- If some amino acids at the C-terminal of the sequence cannot find matched residues in PDB, it just shoots a warning.
- For all other cases, it stops and shoots an error message with detailed information about the mismatch.
The suggestion is to make it more consistent, i.e. wherever the mismatch occurs the function stops and shoots error. What do you think?
- 2016-09-27T18:01:07+00:00
Barry Grant
Thanks for summarizing!

I actually like it the way it is (with an exit and useful miss-match message for all but the C-term). Allowing some ambiguity at the ends is useful. We could have a strict=TRUE option in the future if run into some problems with this down the road.

Happy to close this issue
- 2016-09-27T19:15:51+00:00
Xinqiu Yao
- changed status to resolved
We leave the way it is, i.e. allowing some ambiguity at the ends. We could have a strict=TRUE option in the future if run into some problems with this down the road.
- 2016-09-27T20:02:30+00:00
Log in to comment

Assignee: Lars Skjærven

Type: enhancement

Priority: minor

Status: resolved

Component: ToDo

Version: v2.2

Votes: 0

Watchers: 3