Grantlab / bio3d / issues / #35 - Bug with Calcium being miss-recognized as a protein residue - aa2mass() seqpdb() read.pdb() atom.select()

Issue #35 resolved

Barry Grant created an issue 2013-10-09

The element type "CA" may also appears in non-protein residues - particular calcium ions (where resid is also "CA" offering a possible catch).

This will have the effect of causing "unknown residue" warnings and the mistake of including "X" in sequences. This will potentially cause confusion with seqaln() etc.

This bug will also be passed to aa2mass(), which will cause nma() to bum out. See below example.

Suggest implementing a fix for this in the atom.select() and read.pdb() functions, which are called by pdbseq() and aa2mass() respectively.

(Note that aa2mass() uses the following line "sequ <- pdb$atom[pdb$calpha, "resid"]" rather than going the atom.select() route. Regardless, we need a fix the $calpha in read.pdb().

Here is an example case:

pdb <- read.pdb("1lf0", het2atom=TRUE)
pdbseq(pdb)
aa2mass(pdb)

# Does not happen if het2atom=FALSE

Comments (10)

Xinqiu Yao
Right, it is a problem. Let me take care of it
- 2013-10-09T20:15:11+00:00
Xinqiu Yao
- assigned issue to
  
  Xinqiu Yao
- 2013-10-09T20:15:29+00:00

Barry Grant reporter

Added a fix to read.pdb() but decided that atom.select() is working as it should when selection is elety="CA". However, if selection is "calpha" then a check for "protein" should probably be made.

One option is to edit some of these lines:

# Old lines in switch function (line 120):
                 calpha = "//////CA/",
                 cbeta = "//////N,CA,C,O,CB/",
                 backbone = "//////N,CA,C,O/",
                 back = "//////N,CA,C,O/",

# To 
                 calpha = "////",paste(prot.aa, collapse=","), "//CA/",
                 cbeta = "////",paste(prot.aa, collapse=","), "//N,CA,C,O,CB/",
                 backbone = "////",paste(prot.aa, collapse=","), "//N,CA,C,O/",
                 back = "////",paste(prot.aa, collapse=","), "//N,CA,C,O/",

However, then we would miss counting non-standard amino acids that are not in "prot.aa" vector. This would suck so I decided not to change this for now.

Other suggestions are welcome...

2013-10-09T20:20:03+00:00

Lars Skjærven
hmm.. as long as the HETATM records are converted to ATOM, in the way they are now there is only a few solutions I guess: (1) specifically check the residue name for CA as you do in read.pdb, or (2) check the residue name length == 3. None of them are satisfactory because the underlying problem is the data structure of the modified PDB file.

A potential fix to this could be the inclusion of a flag for HETATM somehow. In fact, adding a new column could be considered in my opinion. i.e. pdb$atom[,"record"] would hold ATOM, HETATM etc...

A temporary solution for the pdbsplit, pdbaln, and, nma.pdbs flow of functions would be to add het2atom to the argument of pdbsplit (or alternatively ...) to avoid this horror:
```
> all.modes <- nma.pdbs(pdbs)
 Building Hessian...            Done in 0.175 seconds.
 Diagonalizing Hessian...       Done in 1.046 seconds.
Error in write.pdb(pdb = NULL, xyz = tmp.xyz, resno = resno, chain = chain,  : 
  write.pdb: 'xyz' coordinates must have no NA's.
```
- 2013-10-15T18:50:10+00:00
Barry Grant reporter
Quick question: Why do we need to explicitly hardwire the het2atom=T in our various read.pdb() calls prior to nma.pdbs() - its quite likely that we will not have the mass for these residues anyway.

I guess if there is a specific ligand that a user wants to include in an NMA calculation (e.g. a nucleotide) then it will be up to them to add the mass pick the course grained atom definition and potentially clean the input files. If the ligand contains a "CA" "elety" of course we will be back to this problem. However, perhaps its very rare and not worth overly stressing about until we see it occur in widespread use.
- 2013-10-15T19:41:42+00:00

Lars Skjærven

exactly, that's why I suggest changing:

`pdbsplit` <-
function(pdb.files, ids=NULL, path="split_chain", multi=FALSE) {

`pdbsplit` <-
function(pdb.files, ids=NULL, path="split_chain", multi=FALSE, ... ) {

where ... goes to read.pdb

So that line 18 changes from

pdb <- read.pdb(pdb.files[i], multi=multi, het2atom = FALSE, maxlines = -1)

pdb <- read.pdb(pdb.files[i], multi=multi, ... )

hmm?

2013-10-15T19:49:50+00:00

Barry Grant reporter
I see, makes sense now Lars - I didn't get that your "..." text was actually referencing /dots and not just blah blah.

We can certainly make this change to pdbsplit() rather easily but pdbaln() already has /dots that get passed to seqaln(). Do we have a snippet of code somewhere that checks on the expected input arguments for a particular function and matches those to /dots and warns if we have any left over after all internal functions have been run?

If we pass all /dots to a function that does not accept them will it not bum out?
- 2013-10-15T20:00:02+00:00
Lars Skjærven
Ok. I'll update pdbsplit (tomorrow). This can be done independently of pdbaln (right?)

As for multi \dots, I think we need to process them internally in pdbaln before sending them to read.pdb and seqaln. If not, it will collapse. I did this in nma in which \dots goes to two or three functions. something like this:
```
bh.names <- names(formals( build.hessian ))
am.names <- names(formals( aa2mass ))

dots <- list(...)
bh.args <- dots[names(dots) %in% bh.names]
am.args <- dots[names(dots) %in% am.names]   

masses <- do.call('aa2mass', c(list(pdb=sequ, inds=NULL), am.args))
```
- 2013-10-15T20:15:43+00:00
Barry Grant reporter
Great thanks. That code snippet is what I was looking for and remember seeing it in some of your code.
- 2013-10-15T20:42:16+00:00
Xinqiu Yao
- changed status to resolved
Solved!
- 2013-11-06T17:20:30+00:00
Log in to comment

Assignee: Xinqiu Yao

Type: bug

Priority: major

Status: resolved

Component: –

Version: –

Votes: 0

Watchers: 3