Should we include a $sse in aligned 'pdbs' class objects?

Issue #140 resolved
Barry Grant created an issue

Looking at the recent dssp.pdbs() function it strikes me that we have all the elements to extract a similar 'aligned secondary structure definition' from the input PDB files themselves, rather than requiring dssp() or stride() to do this in dssp.pdbs().

For example, in the new plot.bio3d() there is a function pdb2sse() that will output a sse vector similar to that output by dssp.pdbs(). Therefore we can just call this and save an alignment matrix of sse values per structure within read.fasta.pdb().

Here is pdb2sse() as it is within plot.bio3d() currently. Note that it only takes H and E as that is all we typically care to annotate in plots...

So my question is: Would this aligned $sse be useful to have in pdbs class objects? Also more generally would a $sse vector be useful to have within a pdb class object itself?

Currently I think the answer to the last one is no - as only plot.bio3d() uses this currently and it will slow the already slow read.pdb() down even further. What do others think about the aligned 'pdbs' objects?

  pdb2sse <- function(pdb) {
    ##- Function to obtain an SSE sequence vector from a PDB object
    ##   Result similar to that returned by stride(pdb)$sse and dssp(pdb)$sse
    ##   This could be incorporated into read.pdb() if found to be more generally useful

    if(is.null(pdb$helix) & is.null(pdb$sheet)) {
      warning("No helix and sheet defined in input 'sse' PDB object: try using dssp()")
      ##ss <- try(dssp(pdb)$sse)
      ## Probably best to get user to do this separately due to possible 'exefile' problems etc..
      return(NULL)
    }
    rn <- pdb$atom[pdb$calpha, c("resno", "chain")]
    ss <- rep(" ", nrow(rn))
    names(ss) = paste(rn$resno,rn$chain,sep="_")

    for(i in 1:length(pdb$helix$start)) {
      ss[ (rn$chain==pdb$helix$chain[i] &
           rn$resno >= pdb$helix$start[i] &
           rn$resno <= pdb$helix$end[i])] = "H"
    }
    for(i in 1:length(pdb$sheet$start)) {
      ss[ (rn$chain==pdb$sheet$chain[i] &
           rn$resno >= pdb$sheet$start[i] &
           rn$resno <= pdb$sheet$end[i])] = "E"
    }
    return(ss)
  }

Comments (10)

  1. Lars Skjærven

    good point. can we add it to read.fasta.pdb without affecting the running time much? we can still keep the dssp.pdbs() since it has a slightly different use (otherwise you wouldn't need dssp.pdb() either). another alternative is to adjust dssp.pdbs so that the user can choose to read the SSE data directly from PDB or call DSSP?

    I think I agree that we dont need it in the PDB.

  2. Barry Grant reporter

    I have put a function with this sse reading capability in the 'new_funs/' dir. I wonder what is the best format for the sse alignment matrix? Should gap positions be NA and coil positions be " " (space) or should they all be "-". If the later how do we distinguish missing positions? The dssp.pdbs() puts everything to gaps "-". See examples below:

    > library(bio3d)
    > source("read.fasta.pdb2.R")
    >
    > # Some online examples of aligned PDBs with SSE annotation
    > aln  <- read.fasta( system.file("examples/kif1a.fa",package="bio3d") )
    > pdbs2 <- read.fasta.pdb2(aln)
    
    > ## Note that gaps are NA and coil positions " " currently
    > pdbs2$sse[,335:344]
                                           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
    http://www.rcsb.org/pdb/files/1bg2.pdb "H"  "H"  "H"  "H"  "H"  "H"  " "  " "
    http://www.rcsb.org/pdb/files/1i6i.pdb "H"  "H"  "H"  "H"  "H"  "H"  "H"  "H"
    http://www.rcsb.org/pdb/files/1i5s.pdb "H"  "H"  "H"  "H"  "H"  "H"  "H"  "H"
    http://www.rcsb.org/pdb/files/2ncd.pdb "H"  "H"  "H"  "H"  "H"  "H"  "H"  " "
                                           [,9] [,10]
    http://www.rcsb.org/pdb/files/1bg2.pdb " "  " "
    http://www.rcsb.org/pdb/files/1i6i.pdb " "  " "
    http://www.rcsb.org/pdb/files/1i5s.pdb NA   NA
    http://www.rcsb.org/pdb/files/2ncd.pdb " "  " "
    
    
    
    > pdbs2$ali[,335:344]
                                           [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
    http://www.rcsb.org/pdb/files/1bg2.pdb "L"  "L"  "F"  "G"  "Q"  "R"  "A"  "K"
    http://www.rcsb.org/pdb/files/1i6i.pdb "L"  "R"  "Y"  "A"  "D"  "R"  "A"  "K"
    http://www.rcsb.org/pdb/files/1i5s.pdb "L"  "R"  "Y"  "A"  "D"  "R"  "A"  "K"
    http://www.rcsb.org/pdb/files/2ncd.pdb "L"  "R"  "F"  "A"  "A"  "S"  "V"  "N"
                                           [,9] [,10]
    http://www.rcsb.org/pdb/files/1bg2.pdb "T"  "I"
    http://www.rcsb.org/pdb/files/1i6i.pdb "Q"  "I"
    http://www.rcsb.org/pdb/files/1i5s.pdb "-"  "-"
    http://www.rcsb.org/pdb/files/2ncd.pdb "S"  "C"
    >
    
    
    > ##-- Q: Should we fill gaps and or coil positions with "-", " " or NA?
    
    
    >
    > ## Another example to compare to dssp.pdbs()
    > ## note that dssp.pdbs() will not work with online files like those above
    > sse <- dssp.pdbs(pdbs2)
    Error in dssp.pdbs(pdbs2) :
      http://www.rcsb.org/pdb/files/1bg2.pdb does not exist
    >
    > ## Fetch PDB files and split to chain A only PDB files
    > ids <- c("1a70_A", "1czp_A", "1frd_A", "1fxi_A", "1iue_A", "1pfd_A")
    > raw.files <- get.pdb(ids, path = "raw_pdbs")
    > files <- pdbsplit(raw.files, ids, path = "raw_pdbs/split_chain")
    
    > ## Sequence Alignment and SSE assignment
    > pdbs <- pdbaln(files)
    Reading PDB files:
    > ## from PDB files, note no SSE defined
    > pdbs3 <- read.fasta.pdb2(pdbs)
    > pdbs3$ali[,1:10]
                                    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
    raw_pdbs/split_chain/1a70_A.pdb "A"  "A"  "Y"  "K"  "V"  "T"  "L"  "-"  "-"
    raw_pdbs/split_chain/1czp_A.pdb "A"  "T"  "F"  "K"  "V"  "T"  "L"  "I"  "N"
    raw_pdbs/split_chain/1frd_A.pdb "A"  "S"  "Y"  "Q"  "V"  "R"  "L"  "I"  "N"
    raw_pdbs/split_chain/1fxi_A.pdb "A"  "S"  "Y"  "K"  "V"  "T"  "L"  "-"  "-"
    raw_pdbs/split_chain/1iue_A.pdb "A"  "F"  "Y"  "N"  "I"  "T"  "L"  "-"  "-"
    raw_pdbs/split_chain/1pfd_A.pdb "A"  "T"  "Y"  "N"  "V"  "K"  "L"  "-"  "-"
                                    [,10]
    raw_pdbs/split_chain/1a70_A.pdb "V"
    raw_pdbs/split_chain/1czp_A.pdb "E"
    raw_pdbs/split_chain/1frd_A.pdb "K"
    raw_pdbs/split_chain/1fxi_A.pdb "K"
    raw_pdbs/split_chain/1iue_A.pdb "R"
    raw_pdbs/split_chain/1pfd_A.pdb "I"
    > pdbs3$sse[,1:10]
                                    [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
    raw_pdbs/split_chain/1a70_A.pdb " "  " "  " "  " "  " "  " "  " "  NA   NA
    raw_pdbs/split_chain/1czp_A.pdb " "  " "  " "  " "  " "  " "  " "  " "  " "
    raw_pdbs/split_chain/1frd_A.pdb " "  " "  " "  " "  " "  " "  " "  " "  " "
    raw_pdbs/split_chain/1fxi_A.pdb " "  " "  " "  " "  " "  " "  " "  NA   NA
    raw_pdbs/split_chain/1iue_A.pdb " "  " "  " "  " "  " "  " "  " "  NA   NA
    raw_pdbs/split_chain/1pfd_A.pdb " "  " "  " "  " "  " "  " "  " "  NA   NA
                                    [,10]
    raw_pdbs/split_chain/1a70_A.pdb " "
    raw_pdbs/split_chain/1czp_A.pdb " "
    raw_pdbs/split_chain/1frd_A.pdb " "
    raw_pdbs/split_chain/1fxi_A.pdb " "
    raw_pdbs/split_chain/1iue_A.pdb " "
    raw_pdbs/split_chain/1pfd_A.pdb " "
    >
    
    
    > sse<-dssp.pdbs(pdbs3)
       PDB has ALT records, taking A only, rm.alt=TRUE
    > sse[,1:10]
         [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
    [1,] "-"  "E"  "E"  "E"  "E"  "E"  "E"  "-"  "-"  "E"
    [2,] "-"  "E"  "E"  "E"  "E"  "E"  "E"  "E"  "E"  "T"
    [3,] "-"  "E"  "E"  "E"  "E"  "E"  "E"  "E"  "E"  "T"
    [4,] "-"  "-"  "E"  "E"  "E"  "E"  "E"  "-"  "-"  "E"
    [5,] "-"  "E"  "E"  "E"  "E"  "E"  "E"  "-"  "-"  "E"
    [6,] "-"  "-"  "E"  "E"  "E"  "E"  "E"  "-"  "-"  "E"
    
  3. Lars Skjærven

    looks good! we should probably avoid "-" since we use that for gaps ? suggesting to keep it " ".

    implemented a quick dssp.xyz:

    > pdb <- read.pdb("2mda", multi=TRUE)
    > sse <- dssp.xyz(pdb$xyz, pdb)
    
    > sse[1:5, 1:10]
        [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
    sse " "  " "  " "  "T"  "T"  "S"  "S"  " "  " "  " "  
    sse " "  " "  " "  "S"  "S"  " "  " "  " "  " "  " "  
    sse " "  " "  " "  "S"  "S"  " "  " "  " "  " "  " "  
    sse " "  " "  " "  "S"  "S"  " "  "S"  "S"  " "  " "  
    sse " "  " "  " "  "S"  "S"  " "  " "  "S"  " "  " "  
    

    something like this you had in mind?

  4. Barry Grant reporter

    What about gaps vs coil with pdbs objects? At the moment the new read.fast.pdb() returns " " for coil (as above) and NA for gaps. Should gaps be "-" as in the sequence alignment. Or NA as in the residue number and chain alignment matrices.

    My current preference is for NA. So no more changes would be required.

  5. Lars Skjærven

    NA for gaps makes sense. I think it's important that we dont use "-" for something else than gaps.

  6. Barry Grant reporter

    Following on from Lars's comment above, which I agree with, then we need to change the output of dssp.pdbs().

    Currently it uses gap characters "-" throughout. It should be changed to use blank " " for residue present but with no assigned SSE type and use NA for missing/indel positions. This would then match the new read.fasta.pdb2() and dssp.xyz() functions. See examples below. I don't expect this change to cause problems elsewhere - just documenting here to keep folks posted. Let me know if you disagree with this change otherwise I will ahead and change/update read.fast.pdb() and dssp.pdbs().

    > aln  <- read.fasta( system.file("examples/kif1a.fa",package="bio3d") )
    > raw <- get.pdb( basename.pdb(aln$id), path="raw")
    > pdbs <- pdbaln( raw )
    > sse <- dssp(pdbs)
    
    > sse[,10:15]
         [,1] [,2] [,3] [,4] [,5] [,6]
    [1,] "-"  "-"  "-"  "-"  "-"  "-"
    [2,] "-"  "-"  "-"  "-"  "-"  "-"
    [3,] "-"  "-"  "-"  "-"  "-"  "-"
    [4,] "H"  "H"  "H"  "H"  "H"  "H"
    

    This should be more like the output of the new read.fast.pdb2() function, i.e

    > source("../read.fasta.pdb2.R") # in new_funs/ dir
    > pdbs2 <- read.fasta.pdb2(pdbs)
    > pdbs2$sse[,10:15]
    
                 [,1] [,2] [,3] [,4] [,5] [,6]
    raw/1bg2.pdb NA   NA   NA   NA   NA   NA
    raw/1i6i.pdb NA   NA   NA   NA   NA   NA
    raw/1i5s.pdb NA   NA   NA   NA   NA   NA
    raw/2ncd.pdb "H"  "H"  "H"  "H"  "H"  "H"
    

    Also we might want to add the ability for dssp.xyz() and dssp.pdb() to be able to read and process online files.

    > aln  <- read.fasta( system.file("examples/kif1a.fa",package="bio3d") )
    > pdbs3 <- read.fasta.pdb2(aln)
    > ss<-dssp(pdbs2)
    Error in dssp.pdbs(pdbs3) :
      http://www.rcsb.org/pdb/files/1bg2.pdb does not exist
    
  7. Barry Grant reporter

    ENHANCEMENT: Updates for dssp.pdbs()

    Output format changed from all "-" to " " and "NA" see issue #140 Can now also work with online PDB files and alignment/pdbs objects that contain only a portion of the full corresponding PDB structure.

    → <<cset 2591afa7ac5d>>

  8. Barry Grant reporter

    ENHANCEMENT: Updates for dssp.pdbs()

    Output format changed from all "-" to " " and "NA" see issue #140 Can now also work with online PDB files and alignment/pdbs objects that contain only a portion of the full corresponding PDB structure.

    → <<cset fd3f0bc902f0>>

  9. Log in to comment