Suggestions for cat.pdb()

Issue #171 resolved

Xinqiu Yao created an issue 2015-01-19

Hi,

I have some suggestions for updating the function cat.pdb() and would like to discuss with you guys.

Currently it requires at least two pdb objects as input. Can we modify to allow one pdb input also (simply return the pdb)? This is a trivial change but could be helpful to tidy up codes in scripts in which we usually don't know how many inputs there are (e.g. in my new function pdb.biounit.R). Otherwise we need to check the number of pdb objects and so additional lines of codes.

pdb = read.pdb("1tag")
cat.pdb(pdb)
#Error in cat.pdb(pdb) : provide multiple (>1) PDB objects

It removes SSE annotations completely. Can we work on SSE concatenation also something like a reverse operation of trim.pdb()? This will be useful if we later use the result for SSE annotation with e.g. plot.bio3d(). For example, with the new function pdb.biounit() we can build the normal hemoglobin tetramer based on a dimer pdb coordinates. The identical SSE annotation should repeat crossing new copies.

pdb <- read.pdb("1tag")
cat.pdb(pdb, pdb)$helix
#NULL

It seems doesn't work for concatenating purely non-protein pdb objects. For example,

pdb <- read.pdb("1tag")
pdb1 <- trim.pdb(pdb, "protein")
pdb2 <- trim.pdb(pdb, "ligand")
cat.pdb(pdb1, pdb2)
#Error in xyz[i, 1] : subscript out of bounds

Is it a bug or a design? Can we change to allow concatenate any atom records?

Let me know what you think. I can also help to update it if all/some of these suggestions are got approved. Thanks a lot!

Comments (18)

Barry Grant
My reply to all those questions is 'yes' and for the last one it looks like a bug.

Q1. should return the same PDB object

Q2. we should combine SSE also but be careful about chain ids and residue numbers (perhaps by default forcing different chain ids - but hone again this might be a bad idea)

Q3. This is a bug to be squished.
- 2015-01-19T19:17:39+00:00
Xinqiu Yao reporter
Thanks! Probably a call of clean.pdb() to check e.g. chain breaks, residue numbers, etc. in Q2.
- 2015-01-19T20:02:48+00:00
Lars Skjærven
Good points Xinqiu.

Q1: should return the same PDB object, IF arguments renumber=FALSE, rechain=FALSE. no?

Q3: Bug was related to inspect.connectivity() and is now fixed (94b09cf).

Thanks !
- 2015-01-20T09:02:30+00:00
Xinqiu Yao reporter
Thanks for the fixing! Right, for Q1 we want it return the same PDB object if both renumber and rechain are FALSE (Should they be default value?).
- 2015-01-20T16:44:35+00:00
Barry Grant
Yes, I believe so.
- 2015-01-20T16:57:19+00:00
Lars Skjærven
FALSE as default sounds good
- 2015-01-20T16:58:02+00:00

Xinqiu Yao reporter

Updated with this commit. The function returns the same PDB if provided one pdb object and can deal with SSE concatenation for multiple pdb input. The new function clean.pdb() is called to do the renumbering task. Let me know if you find any problem!

# One pdb, just return it
pdb1 <- read.pdb("1etl")
cat.pdb(pdb1)

 Call:  cat.pdb(pdb1)

   Total Models#: 1
     Total Atoms#: 160,  XYZs#: 480  Chains#: 1  (values: A)

     Protein Atoms#: 138  (residues/Calpha atoms#: 12)
     Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)

     Non-protein/nucleic Atoms#: 22  (residues: 14)
     Non-protein/nucleic resid values: [HOH (13), MPR (1) ]

   Protein sequence:
      CELCCNPGCAGC

+ attr: atom, helix, sheet, seqres, xyz,
        calpha, call, clean.log

# Three, concatenate all components
pdb2 <- read.pdb("1hel")
new.pdb <- cat.pdb(pdb1, pdb2, pdb1, rechain=TRUE, renumber=TRUE)
new.pdb


 Call:  cat.pdb(pdb1, pdb2, pdb1, renumber = TRUE, rechain = TRUE)

   Total Models#: 1
     Total Atoms#: 1506,  XYZs#: 4518  Chains#: 3  (values: A B C)

     Protein Atoms#: 1277  (residues/Calpha atoms#: 153)
     Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)

     Non-protein/nucleic Atoms#: 229  (residues: 213)
     Non-protein/nucleic resid values: [HOH (211), MPR (2) ]

   Protein sequence:
      CELCCNPGCAGCKVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTD
      GSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAW
      VAWRNRCKGTDVQAWIRGCRLCELCCNPGCAGC

+ attr: atom, helix, sheet, seqres, xyz,
        calpha, call, clean.log

new.pdb$helix
$start

  4  25  79  88 104 109 120 

$end

 16  34  84 100 108 114 124 

$chain
[1] "B" "B" "B" "B" "B" "B" "B"

$type
[1] "5" "1" "1" "5" "1" "5" "1" "5" "5"

new.pdb$clean.log
[1] "pdb$calpha                               -> UPDATED     "
[2] "force.renumber = TRUE                    -> RENUMBERED  "
[3] "SSE annotation                           -> UPDATED     "
[4] "PDB is clean!                           "

2015-01-24T23:55:39+00:00

Lars Skjærven

hmm.. I get errors when concatenating your example pdb files:

> pdb1 <- read.pdb("1etl")
> pdb2 <- read.pdb("1hel")
> new.pdb <- cat.pdb(pdb1, pdb2)
Error in .check.residue.ambiguity(pdb) : 
  .check.residue.ambiguity(): Found ambiguous residue labeling
Error in cat.pdb(pdb1, pdb2) : 
  cat.pdb(): Bad format pdb generated. Try rechain=TRUE and/or renumber=TRUE

> new.pdb <- cat.pdb(pdb1, pdb2, renumber=T, rechain=T)
Error in rep(pdb$helix$type, (pdb$helix$end - pdb$helix$start + 1)) : 
  invalid 'times' argument

I like the clean.log attribute, but could we call it just log? An perhaps we can use a data frame with two columns with names data and action (or something in that direction)?

2015-01-26T09:07:19+00:00

Xinqiu Yao reporter
Did you update with the latest commit? I tried again and it works fine with options renumber=T and rechain=T.

Thanks for the comments on clean.log! I like your suggestions. Let's put it to To-do list. I will also be very happy if you guys can help test the clean.pdb() function with your favorite PDBs, since we could extensively use it in many other functions if it works properly.
- 2015-01-26T14:57:47+00:00
Lars Skjærven
I have done a 'git pull' yes. I tried on my mac as well just now and I get the same error for the code
```
pdb1 <- read.pdb("1etl")
pdb2 <- read.pdb("1hel")
new.pdb <- cat.pdb(pdb1, pdb2)
```
hmmm
- 2015-01-26T15:05:00+00:00
Lars Skjærven
ah.. this error is actually intended. perhaps we could issue a warning instead ? using rechain=T and/or renumber=T works. thanks !

I'll check out clean.pdb more thoroughly.
- 2015-01-26T15:11:13+00:00

Barry Grant

I get the same errors with this PDB (but not with rechain=T)

> pdb1 <- read.pdb("1etl")

> cat.pdb(pdb1,pdb1)
Error in .check.residue.ambiguity(pdb) :
  .check.residue.ambiguity(): Found ambiguous residue labeling
Error in cat.pdb(pdb1, pdb1) :
  cat.pdb(): Bad format pdb generated. Try rechain=TRUE and/or renumber=TRUE

>
>
> cat.pdb(pdb1,pdb1, rechain=T)

 Call:  cat.pdb(pdb1, pdb1, rechain = T)

   Total Models#: 1
     Total Atoms#: 320,  XYZs#: 960  Chains#: 2  (values: A B)

     Protein Atoms#: 276  (residues/Calpha atoms#: 24)
     Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)

     Non-protein/nucleic Atoms#: 44  (residues: 28)
     Non-protein/nucleic resid values: [HOH (26), MPR (2) ]

   Protein sequence:
      CELCCNPGCAGCCELCCNPGCAGC

+ attr: atom, helix, seqres, xyz, calpha,
        call, sheet, clean.log

>
>
> cat.pdb(pdb1,pdb1, renumber=T)
Error in .check.residue.ambiguity(pdb) :
  .check.residue.ambiguity(): Found ambiguous residue labeling
Error in cat.pdb(pdb1, pdb1, renumber = T) :
  cat.pdb(): Bad format pdb generated. Try rechain=TRUE and/or renumber=TRUE

2015-01-26T15:11:30+00:00

Xinqiu Yao reporter
Yes. These errors are intended because otherwise a bad formatted PDB will be generated (i.e. two different chains share the same chain IDs and so confusing on residue labeling and SSE annotation possibly). Should we still proceed but shoot warnings? If so, we could keep the original version that remove SSE but just concatenate atom and xyz records, because clean.pdb() will reject on confused residue labeling :-P
- 2015-01-26T15:16:34+00:00
Barry Grant
I think a warning and a stripping of SSE records is appropriate.
- 2015-01-26T15:32:36+00:00

Xinqiu Yao reporter

Okay, updated with the latest commit https://bitbucket.org/Grantlab/bio3d/commits/269c62f26f7e643c3678cca323370fb7f5081712

cat.pdb(pdb1, pdb2)
 Call:  cat.pdb(pdb1, pdb2)

   Total Models#: 1
     Total Atoms#: 1346,  XYZs#: 4038  Chains#: 1  (values: A)

     Protein Atoms#: 1139  (residues/Calpha atoms#: 12)
     Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)

     Non-protein/nucleic Atoms#: 207  (residues: 199)
     Non-protein/nucleic resid values: [HOH (198), MPR (1) ]

   Protein sequence:
      CELCCNPGCAGCKVFGRCELAAAMKRHGLDNYRGYSLGNWVCAAKFESNFNTQATNRNTD
      GSTDYGILQINSRWWCNDGRTPGSRNLCNIPCSALLSSDITASVNCAKKIVSDGNGMNAW
      VAWRNRCKGTDVQAWIRGCRL

+ attr: atom, seqres, xyz, calpha, call
Warning messages:
1: In cat.pdb(pdb1, pdb2) :
  cat.pdb(): Bad format pdb generated. Try rechain=TRUE and/or renumber=TRUE
2: In cat.pdb(pdb1, pdb2) : possible chain break in molecule: chain A

2015-01-26T16:10:38+00:00

Lars Skjærven
Looks good.

I would perhaps consider rechain=TRUE as default to avoid the warning. In most practical cases you would need to re-assign the chain identifier to avoid duplicate residues.
- 2015-01-27T18:51:30+00:00
Xinqiu Yao reporter
Yes. rechain=TRUE by default makes sense. Thanks for suggesting!
- 2015-01-27T20:39:09+00:00
Lars Skjærven
- changed status to resolved
- 2015-01-28T08:35:33+00:00
Log in to comment

Assignee: –

Type: proposal

Priority: minor

Status: resolved

Component: –

Version: –

Votes: 0

Watchers: 3