sequence has more residues than PDB has Calpha's

Issue #82 resolved
Lars Skjærven created an issue

when read.fasta.pdb() issues the warning sequence has more residues than PDB has Calpha's, the returned resno element differs from the ali element (ali is returned unchanged from the input alignment argument). would it make sense to correct the returned ali so that it corresponds to what is in the PDB?

this occurs when I use mustang for the sequence alignment with pdb input files containing residues without resolved C-alpha atoms (but e.g. only an N atom). it's not a big issue since gap.inspect(pdbs$resno) can be used instead of gap.inspect(pdbs$ali), and it might be justified to keep the ali unchanged. hmm...

Comments (24)

  1. Xinqiu Yao

    Never think about this... Maybe an option to toggle the correction of $ali? By default, it should be unchanged but print warning message to indicate the inconsistency.

  2. Xinqiu Yao

    I temporarily fixed it with an option, fix.ali: if TRUE, check the consistence between $ali and $resno, and if they don't match correct $ali gap positions. However, this only works when missing CA atoms are on the C terminus, and will give an error otherwise, e.g.

    > npdbs1 <- read.fasta.pdb(aln1)
    pdb/seq: 1   name: CA_missing_Cterm.pdb 
    pdb/seq: 2   name: other.pdb 
    Warning message:
    In FUN(1:2[[2L]], ...) :
      CA_missing_Cterm.pdb : sequence has more residues than PDB has Calpha's
    > npdbs1$ali
                         [,342] [,343] [,344] [,345] [,346] [,347] [,348] [,349]
    CA_missing_Cterm.pdb "I"    "I"    "I"    "K"    "E"    "N"    "L"    "K"   
    other.pdb            "I"    "I"    "I"    "K"    "E"    "-"    "-"    "-"   
                         [,350] [,351] [,352] [,353] [,354]
    CA_missing_Cterm.pdb "D"    "C"    "G"    "L"    "-"   
    other.pdb            "-"    "-"    "-"    "-"    "-"   
                         [,342] [,343] [,344] [,345] [,346] [,347] [,348] [,349]
    CA_missing_Cterm.pdb "338"  "339"  "340"  "341"  "342"  "343"  "344"  "345" 
    other.pdb            "338"  "339"  "340"  "341"  "342"  NA     NA     NA    
                         [,350] [,351] [,352] [,353] [,354]
    CA_missing_Cterm.pdb "346"  "347"  NA     NA     NA    
    other.pdb            NA     NA     NA     NA     NA    
    
    > npdbs1 <- read.fasta.pdb(aln1, fix.ali=TRUE)
    pdb/seq: 1   name: CA_missing_Cterm.pdb 
    pdb/seq: 2   name: other.pdb 
    Warning messages:
    1: In FUN(1:2[[2L]], ...) :
      CA_missing_Cterm.pdb : sequence has more residues than PDB has Calpha\'s
    2: In read.fasta.pdb(aln1, fix.ali = TRUE) :
      $ali component is modified to match $resno
    > npdbs1$ali
    
                         [,342] [,343] [,344] [,345] [,346] [,347] [,348] [,349]
    CA_missing_Cterm.pdb "I"    "I"    "I"    "K"    "E"    "N"    "L"    "K"   
    other.pdb            "I"    "I"    "I"    "K"    "E"    "-"    "-"    "-"   
                         [,350] [,351] [,352] [,353] [,354]
    CA_missing_Cterm.pdb "D"    "C"    "-"    "-"    "-"   
    other.pdb            "-"    "-"    "-"    "-"    "-"   
    
    > npdbs2 <- read.fasta.pdb(aln2, fix.ali=TRUE)
    pdb/seq: 1   name: CA_missing_middle.pdb 
    ERROR: CA_missing_middle.pdb alignment and pdb sequences do not match 
    ...
    
  3. Barry Grant

    Going back to the original question here. It is clear that MUSTANG needs 'cleaned' PDB input files so I think this should be taken care of in the actual mustang() bio3d function itself.

    Obviously, such a mismatch could still occur in which case the current error message is spot on.

    However, I have been wanting to provide a better format of Error message output showing the sequence-structure miss-match along the lines of the print.fasta() but with conservation shown (as we discussed elsewhere).

    I can put this on the ToDo list along with any suggestions for what might be helpful here...

  4. Lars Skjærven reporter

    right.. perhaps I can change mustang to write CA coordinates to disk, and use these as input instead.

  5. Lars Skjærven reporter
    ids = c("1cdk_A", "3dnd_A")
    rawfiles <- get.pdb(ids)
    files <- pdbsplit(rawfiles, ids)
    
    aln1 <- mustang(files, cleanpdb=TRUE)
    pdbs1 <- read.fasta.pdb(aln1)
    
    aln2 <- mustang(files, cleanpdb=FALSE)
    pdbs2 <- read.fasta.pdb(aln2)
    

    Proposed workaround is to map any non-standard residues to standard residues (e.g. SEP->SER, etc). Not ideal, since atom.select(pdb, "calpha") omits non-standard residues which might in fact be a residue.

  6. Lars Skjærven reporter

    Error reporting has been updated. See commit 216bb88.

    I think this should be ok now, but feel free to test it.

    > ids = c("1cdk_A", "3dnd_A")
    > rawfiles <- get.pdb(ids)
    > files <- pdbsplit(rawfiles, ids)
    
    > aln2 <- mustang(files, cleanpdb=FALSE)
    > pdbs2 <- read.fasta.pdb(aln2)
    pdb/seq: 1   name: split_chain/1cdk_A.pdb 
    
     ERROR   Alignment mismatch. See alignment below for further details
             (row 1 of aln and sequence of 'split_chain/1cdk_A.pdb').
             First mismatch residue in PDB is: TPO-197 (A) 
             occurring at alignment position: 190 
    
              1        .         .         .         .         .         .         70 
    aliseq   KGSEQESVKEFLAKAKEDFLKKWENPAQNTAHLDQFERIKTLGTGSFGRVMLVKHKETGNHFAMKILDKQ
    pdbseq   KGSEQESVKEFLAKAKEDFLKKWENPAQNTAHLDQFERIKTLGTGSFGRVMLVKHKETGNHFAMKILDKQ
             ********************************************************************** 
             1        .         .         .         .         .         .         70 
    
            71        .         .         .         .         .         .         140 
    aliseq   KVVKLKQIEHTLNEKRILQAVNFPFLVKLEYSFKDNSNLYMVMEYVPGGEMFSHLRRIGRFSEPHARFYA
    pdbseq   KVVKLKQIEHTLNEKRILQAVNFPFLVKLEYSFKDNSNLYMVMEYVPGGEMFSHLRRIGRFSEPHARFYA
             ********************************************************************** 
            71        .         .         .         .         .         .         140 
    
           141        .         .         .         .         .         .         210 
    aliseq   AQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTDFGFAKRVKGRTWLCGTPEYLAPEIILSKGYNKA
    pdbseq   AQIVLTFEYLHSLDLIYRDLKPENLLIDQQGYIQVTDFGFAKRVKGRTWTLCGTPEYLAPEIILSKGYNK
             *************************************************            *^        
           141        .         .         .         .         .         .         210 
    
           211        .         .         .         .         .         .         280 
    aliseq   VDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGKVRFPSHFSSDLKDLLRNLLQVDLTKRFGNLKDGV
    pdbseq   AVDWWALGVLIYEMAAGYPPFFADQPIQIYEKIVSGKVRFPSHFSSDLKDLLRNLLQVDLTKRFGNLKDG
                *    ^^    *^  * *           ^           *     *   *      ^         
           211        .         .         .         .         .         .         280 
    
           281        .         .         .         .         .         . 342 
    aliseq   NDIKNHKWFATTDWIAIYQRKVEAPFIPKFKGPGDTSNFDDYEEEEIRVSINEKCGKEFSEF
    pdbseq   VNDIKNHKWFATTDWIAIYQRKVEAPFIPKFKGPGDTSNFDDYEEEEIRVSINEKCGKEFSE
                   ^ ^  *        ^               ^   *  ***                 
           281        .         .         .         .         .         . 342 
    
    Error: 1cdk_A alignment and pdb sequences do not match
    
  7. Barry Grant

    Indeed, this is much better. In the last line could also add the position information as sometimes the volumes output can lead to these being missed. E.g. The last line could say something like:

    Error: 1cdk_A alignment and PDB sequence miss-match beginning position 190 (PDB RESNO 197)
    
  8. Lars Skjærven reporter

    Ok. I'll add that. I also just realized that if the mismatch occurs in residues 1:15, the error message is just:

    starting residues of sequence not found in PDB
    
    > ids <- c("1rx2_A", "1rx4_A", "1rg7_A") 
    > raw.files <- get.pdb(ids, path="raw_pdbs")
    > files     <- pdbsplit( raw.files, ids )
    > pdbs <- pdbaln(files)
    > pdbs$ali[1,5] <- "-"
    
    > tmp <- read.fasta.pdb(pdbs)
    pdb/seq: 1   name: split_chain/1rx2_A.pdb 
       PDB has ALT records, taking A only, rm.alt=TRUE
    Error: split_chain/1rx2_A.pdb : starting residues of sequence not found in PDB
    
  9. Lars Skjærven reporter

    Updated. Please check my revision in commit

    > tmp <- read.fasta.pdb(pdbs)
    pdb/seq: 1   name: split_chain/1rx2_A.pdb 
       PDB has ALT records, taking A only, rm.alt=TRUE
    
     ERROR   Alignment mismatch. See alignment below for further details
             (row 1 of aln and sequence of 'split_chain/1rx2_A.pdb').
             First mismatch residue in PDB is: ILE-5 (A) 
             occurring at alignment position: 6 
    
             1        .         .         .         .         .         .         70 
    aliseq   MISL-AALAVDRVIGMENAMPWNLPADLAWFKRNTLDKPVIMGRHTWESIGRPLPGRKNIILSSQPGTDD
    pdbseq   MISL-IAALAVDRVIGMENAMPWNLPADLAWFKRNTLDKPVIMGRHTWESIGRPLPGRKNIILSSQPGTD
             ****^ *      ^                ^ ^       ^^  ^            ^  *^ *     * 
             1        .         .         .         .         .         .         70 
    
            71        .         .         .         .         .         .         140 
    aliseq   RVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSEF
    pdbseq   DRVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSE
                      ^   *       ^^^ **                                  *         
            71        .         .         .         .         .         .         140 
    
           141        .        159 
    aliseq   HDADAQNSHSYCFEILERR
    pdbseq   FHDADAQNSHSYCFEILER
                   ^        ^  * 
           141        .        159 
    
    Error: 1rx2_A alignment and PDB sequence miss-match
           beginning at position 6 (PDB RESNO ILE-5 (A))
    
  10. Barry Grant

    This looks like a good improvement from my quick tests (see below), and the detailed sequence print out is obviously more helpful than before. At the moment we fail if we have missing residues anywhere BUT the C-termini (again see below, Error with N-termini or middle missing but Warning with C-termini missing) - is this what we want?

    library(bio3d)
    ids <- c("1rx2_A", "1rx4_A", "1rg7_A")
    raw.files <- get.pdb(ids, path="raw_pdbs")
    files     <- pdbsplit( raw.files, ids )
    pdbs <- pdbaln(files)
    npdbs <- pdbs
    id <- npdbs$id[1]
    pdb <- read.pdb(id)
    
    ##- Remove N-termini residues test
    npdb <- trim.pdb(pdb, resno=1:5, inverse=TRUE)
    write.pdb(npdb, file="tmp.pdb")
    tmp <- read.fasta.pdb(npdbs)
    
    # Error: tmp alignment and PDB sequence miss-match
    #       beginning at position 1 (PDB RESNO ALA-6 (A))
    # In addition: Warning message:
    # tmp.pdb : sequence has more residues than PDB has Calpha's
    
    
    ##- Remove C-termini residues test
    npdb <- trim.pdb(pdb, resno=155:159, inverse=TRUE)
    write.pdb(npdb, file="tmp.pdb")
    tmp <- read.fasta.pdb(npdbs)
    
    # Warning message:
    # tmp.pdb : sequence has more residues than PDB has Calpha's
    
    
    ##- Remove middle residues test
    npdb <- trim.pdb(pdb, resno=30:35, inverse=TRUE)
    write.pdb(npdb, file="tmp.pdb")
    tmp <- read.fasta.pdb(npdbs)
    
    # Error: tmp alignment and PDB sequence miss-match
    #       beginning at position 30 (PDB RESNO LEU-36 (A))
    # In addition: Warning message:
    # tmp.pdb : sequence has more residues than PDB has Calpha's
    

    I guess having the C-termini pass with just a warning could come back to haunt people later on if they are checking for missing residues in the alignment and the the coordinates/resno etc....

  11. Xinqiu Yao

    I guess it might be better to be consistent, i.e. stopping for mismatching at any place, isn't it?

  12. Xinqiu Yao

    Should we proceed with the suggestion, i.e. stopping and giving an error when a mismatch occurs between sequence alignment and pdb structure, wherever the mismatch is located?

  13. Barry Grant

    can you summarize where we stand currently - looks like original issue is solved but other things (inconsistencies) were noted in the thread. Thanks!

  14. Xinqiu Yao

    Yes. The main issue discussed here is how the function read.fasta.pdb() deals with the mismatch between sequence alignment and pdb structures. The current strategy is:

    • If all the amino acids in the sequence can find matched residues in PDB, no message.
    • If some amino acids at the C-terminal of the sequence cannot find matched residues in PDB, it just shoots a warning.
    • For all other cases, it stops and shoots an error message with detailed information about the mismatch.

    The suggestion is to make it more consistent, i.e. wherever the mismatch occurs the function stops and shoots error. What do you think?

  15. Barry Grant

    Thanks for summarizing!

    I actually like it the way it is (with an exit and useful miss-match message for all but the C-term). Allowing some ambiguity at the ends is useful. We could have a strict=TRUE option in the future if run into some problems with this down the road.

    Happy to close this issue

  16. Xinqiu Yao

    We leave the way it is, i.e. allowing some ambiguity at the ends. We could have a strict=TRUE option in the future if run into some problems with this down the road.

  17. Log in to comment