Suggestions for cat.pdb()

Issue #171 resolved
Xinqiu Yao created an issue

Hi,

I have some suggestions for updating the function cat.pdb() and would like to discuss with you guys.

  • Currently it requires at least two pdb objects as input. Can we modify to allow one pdb input also (simply return the pdb)? This is a trivial change but could be helpful to tidy up codes in scripts in which we usually don't know how many inputs there are (e.g. in my new function pdb.biounit.R). Otherwise we need to check the number of pdb objects and so additional lines of codes.
pdb = read.pdb("1tag")
cat.pdb(pdb)
#Error in cat.pdb(pdb) : provide multiple (>1) PDB objects
  • It removes SSE annotations completely. Can we work on SSE concatenation also something like a reverse operation of trim.pdb()? This will be useful if we later use the result for SSE annotation with e.g. plot.bio3d(). For example, with the new function pdb.biounit() we can build the normal hemoglobin tetramer based on a dimer pdb coordinates. The identical SSE annotation should repeat crossing new copies.
pdb <- read.pdb("1tag")
cat.pdb(pdb, pdb)$helix
#NULL
  • It seems doesn't work for concatenating purely non-protein pdb objects. For example,
pdb <- read.pdb("1tag")
pdb1 <- trim.pdb(pdb, "protein")
pdb2 <- trim.pdb(pdb, "ligand")
cat.pdb(pdb1, pdb2)
#Error in xyz[i, 1] : subscript out of bounds

Is it a bug or a design? Can we change to allow concatenate any atom records?

Let me know what you think. I can also help to update it if all/some of these suggestions are got approved. Thanks a lot!

Comments (18)

  1. Barry Grant

    My reply to all those questions is 'yes' and for the last one it looks like a bug.

    Q1. should return the same PDB object

    Q2. we should combine SSE also but be careful about chain ids and residue numbers (perhaps by default forcing different chain ids - but hone again this might be a bad idea)

    Q3. This is a bug to be squished.

  2. Xinqiu Yao reporter

    Thanks! Probably a call of clean.pdb() to check e.g. chain breaks, residue numbers, etc. in Q2.

  3. Lars Skjærven

    Good points Xinqiu.

    Q1: should return the same PDB object, IF arguments renumber=FALSE, rechain=FALSE. no?

    Q3: Bug was related to inspect.connectivity() and is now fixed (94b09cf).

    Thanks !

  4. Xinqiu Yao reporter

    Thanks for the fixing! Right, for Q1 we want it return the same PDB object if both renumber and rechain are FALSE (Should they be default value?).

  5. Xinqiu Yao reporter

    Updated with this commit. The function returns the same PDB if provided one pdb object and can deal with SSE concatenation for multiple pdb input. The new function clean.pdb() is called to do the renumbering task. Let me know if you find any problem!

    # One pdb, just return it
    pdb1 <- read.pdb("1etl")
    cat.pdb(pdb1)
    
     Call:  cat.pdb(pdb1)
    
       Total Models#: 1
         Total Atoms#: 160,  XYZs#: 480  Chains#: 1  (values: A)
    
         Protein Atoms#: 138  (residues/Calpha atoms#: 12)
         Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)
    
         Non-protein/nucleic Atoms#: 22  (residues: 14)
         Non-protein/nucleic resid values: [HOH (13), MPR (1) ]
    
       Protein sequence:
          CELCCNPGCAGC
    
    + attr: atom, helix, sheet, seqres, xyz,
            calpha, call, clean.log
    
    # Three, concatenate all components
    pdb2 <- read.pdb("1hel")
    new.pdb <- cat.pdb(pdb1, pdb2, pdb1, rechain=TRUE, renumber=TRUE)
    new.pdb
    
    
     Call:  cat.pdb(pdb1, pdb2, pdb1, renumber = TRUE, rechain = TRUE)
    
       Total Models#: 1
         Total Atoms#: 1506,  XYZs#: 4518  Chains#: 3  (values: A B C)
    
         Protein Atoms#: 1277  (residues/Calpha atoms#: 153)
         Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)
    
         Non-protein/nucleic Atoms#: 229  (residues: 213)
         Non-protein/nucleic resid values: [HOH (211), MPR (2) ]
    
       Protein sequence:
          CELCCNPGCAGCKVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTD
          GSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAW
          VAWRNRCKGTDVQAWIRGCRLCELCCNPGCAGC
    
    + attr: atom, helix, sheet, seqres, xyz,
            calpha, call, clean.log
    
    new.pdb$helix
    $start
    
      4  25  79  88 104 109 120 
    
    $end
    
     16  34  84 100 108 114 124 
    
    $chain
    [1] "B" "B" "B" "B" "B" "B" "B"
    
    $type
    [1] "5" "1" "1" "5" "1" "5" "1" "5" "5"
    
    new.pdb$clean.log
    [1] "pdb$calpha                               -> UPDATED     "
    [2] "force.renumber = TRUE                    -> RENUMBERED  "
    [3] "SSE annotation                           -> UPDATED     "
    [4] "PDB is clean!                           "                
    
  6. Lars Skjærven

    hmm.. I get errors when concatenating your example pdb files:

    > pdb1 <- read.pdb("1etl")
    > pdb2 <- read.pdb("1hel")
    > new.pdb <- cat.pdb(pdb1, pdb2)
    Error in .check.residue.ambiguity(pdb) : 
      .check.residue.ambiguity(): Found ambiguous residue labeling
    Error in cat.pdb(pdb1, pdb2) : 
      cat.pdb(): Bad format pdb generated. Try rechain=TRUE and/or renumber=TRUE
    
    > new.pdb <- cat.pdb(pdb1, pdb2, renumber=T, rechain=T)
    Error in rep(pdb$helix$type, (pdb$helix$end - pdb$helix$start + 1)) : 
      invalid 'times' argument
    

    I like the clean.log attribute, but could we call it just log? An perhaps we can use a data frame with two columns with names data and action (or something in that direction)?

  7. Xinqiu Yao reporter

    Did you update with the latest commit? I tried again and it works fine with options renumber=T and rechain=T.

    Thanks for the comments on clean.log! I like your suggestions. Let's put it to To-do list. I will also be very happy if you guys can help test the clean.pdb() function with your favorite PDBs, since we could extensively use it in many other functions if it works properly.

  8. Lars Skjærven

    I have done a 'git pull' yes. I tried on my mac as well just now and I get the same error for the code

    pdb1 <- read.pdb("1etl")
    pdb2 <- read.pdb("1hel")
    new.pdb <- cat.pdb(pdb1, pdb2)
    

    hmmm

  9. Lars Skjærven

    ah.. this error is actually intended. perhaps we could issue a warning instead ? using rechain=T and/or renumber=T works. thanks !

    I'll check out clean.pdb more thoroughly.

  10. Barry Grant

    I get the same errors with this PDB (but not with rechain=T)

    > pdb1 <- read.pdb("1etl")
    
    > cat.pdb(pdb1,pdb1)
    Error in .check.residue.ambiguity(pdb) :
      .check.residue.ambiguity(): Found ambiguous residue labeling
    Error in cat.pdb(pdb1, pdb1) :
      cat.pdb(): Bad format pdb generated. Try rechain=TRUE and/or renumber=TRUE
    
    >
    >
    > cat.pdb(pdb1,pdb1, rechain=T)
    
     Call:  cat.pdb(pdb1, pdb1, rechain = T)
    
       Total Models#: 1
         Total Atoms#: 320,  XYZs#: 960  Chains#: 2  (values: A B)
    
         Protein Atoms#: 276  (residues/Calpha atoms#: 24)
         Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)
    
         Non-protein/nucleic Atoms#: 44  (residues: 28)
         Non-protein/nucleic resid values: [HOH (26), MPR (2) ]
    
       Protein sequence:
          CELCCNPGCAGCCELCCNPGCAGC
    
    + attr: atom, helix, seqres, xyz, calpha,
            call, sheet, clean.log
    
    >
    >
    > cat.pdb(pdb1,pdb1, renumber=T)
    Error in .check.residue.ambiguity(pdb) :
      .check.residue.ambiguity(): Found ambiguous residue labeling
    Error in cat.pdb(pdb1, pdb1, renumber = T) :
      cat.pdb(): Bad format pdb generated. Try rechain=TRUE and/or renumber=TRUE
    
  11. Xinqiu Yao reporter

    Yes. These errors are intended because otherwise a bad formatted PDB will be generated (i.e. two different chains share the same chain IDs and so confusing on residue labeling and SSE annotation possibly). Should we still proceed but shoot warnings? If so, we could keep the original version that remove SSE but just concatenate atom and xyz records, because clean.pdb() will reject on confused residue labeling :-P

  12. Xinqiu Yao reporter

    Okay, updated with the latest commit https://bitbucket.org/Grantlab/bio3d/commits/269c62f26f7e643c3678cca323370fb7f5081712

    cat.pdb(pdb1, pdb2)
     Call:  cat.pdb(pdb1, pdb2)
    
       Total Models#: 1
         Total Atoms#: 1346,  XYZs#: 4038  Chains#: 1  (values: A)
    
         Protein Atoms#: 1139  (residues/Calpha atoms#: 12)
         Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)
    
         Non-protein/nucleic Atoms#: 207  (residues: 199)
         Non-protein/nucleic resid values: [HOH (198), MPR (1) ]
    
       Protein sequence:
          CELCCNPGCAGCKVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTD
          GSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAW
          VAWRNRCKGTDVQAWIRGCRL
    
    + attr: atom, seqres, xyz, calpha, call
    Warning messages:
    1: In cat.pdb(pdb1, pdb2) :
      cat.pdb(): Bad format pdb generated. Try rechain=TRUE and/or renumber=TRUE
    2: In cat.pdb(pdb1, pdb2) : possible chain break in molecule: chain A
    
  13. Lars Skjærven

    Looks good.

    I would perhaps consider rechain=TRUE as default to avoid the warning. In most practical cases you would need to re-assign the chain identifier to avoid duplicate residues.

  14. Log in to comment