Issues with the new select approach

Issue #179 resolved
Lars Skjærven created an issue

hmm... a couple of bumps we need to address:

1. All CA-pdbs:

> file <- system.file("examples/hivp.pdb", package="bio3d")
> pdb = read.pdb(file)

> sum(pdb$calpha)
[1] 0

This is because the "calpha" string selection now selects all "protein" residues (i.e. all residues with N,CA,C,O) and intersects selection with elety=CA. The question is whether or not we should keep the pdb$calpha attribute. In my opinion calls to atom.select() whenever CA indices are needed is a safer choice.

2. ligand with N, CA, C, O atoms:

pdb = read.pdb("3DRC")
sele = atom.select(pdb, "protein")
tail(pdb$atom$resid[sele$atom])
[1] "MTX" "MTX" "MTX" "MTX" "MTX" "MTX"

MTX = METHOTREXATE, appears as free ligand in the PDB. I don't see any solution for this one, and also VMD assigns this ligand as "protein".

3. Missing backbone atoms

file <- system.file("examples/1dpx.pdb",package="bio3d")
pdb = read.pdb(file)
pdb

 Call:  read.pdb(file = file)

   Total Models#: 1
     Total Atoms#: 1177,  XYZs#: 3531  Chains#: 1  (values: A)

     Protein Atoms#: 992  (residues/Calpha atoms#: 128)
     Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)

     Non-protein/nucleic Atoms#: 185  (residues: 180)
     Non-protein/nucleic resid values: [CL (2), HOH (177), LEU (1) ]


pdb$atom[atom.select(pdb, resno=129)$atom,1:7]

     type eleno elety  alt resid chain resno
1008 ATOM  1008     N <NA>   LEU     A   129
1009 ATOM  1009    CA <NA>   LEU     A   129
1010 ATOM  1010    CB <NA>   LEU     A   129
1011 ATOM  1011    CG <NA>   LEU     A   129
1012 ATOM  1012   CD1 <NA>   LEU     A   129
1013 ATOM  1013   CD2 <NA>   LEU     A   129

LEU-129 appears as non-protein residue since it's missing the C and O atoms. As above, there is no good solution except from resorting to the old-way...

Comments (29)

  1. Lars Skjærven reporter

    perhaps I should re-open the feature_select branch and re-set to the master branch, because this got more messy than I thought it would be.

  2. Barry Grant

    For your point 1, the pdb$calpha component returned from read.pdb() is actually taken from atom.select.pdb() so the problem is deeper than just getting rid of the, in my opinion, still useful pdb$calpha component.

    The fix for all three points is to change the is.protein() back to the old resid lookup approach I am afraid. We could have an option in your is.protein() related function to use.resid=TRUE (i.e. the old approach) and when FASLE to use the new backbone based approach. This would be useful only for us as we check on adding new resid entries to our list and go through bug reports etc. We may not want to allow this as a "..." option to atom.select.pdb()

    What do you think?

  3. Lars Skjærven reporter

    Ok. Merge is reverted and master should be the same as after Xinqiu's fix of clean.pdb/cat.pdb. I'm re-opening a new branch for the select stuff. Sorry for the mess!

    Resorting to the old style doesn't feel satisfying, but perhaps it's less issues with it. A combination of the two could also be tractable.

    For the calpha attribute, since we cannot determine the CA atoms unambiguously in either way, we might be better off just getting rid of it and leave it up to the user to identify the C-alpha atoms correctly. After all its just a call away..

    I'll prepare a new version and then we'll see

  4. Barry Grant

    Yeah that is a big shame, sorry!

    I think we fixed the CA with calcium problem with the old atom.select(). What is still ambiguous? I would prefer not to get rid of $calpha as I find it useful still.

    btw. Have you got a good example of the previous atom.select() messing up with alpha and perhaps other things that we can implement as a test?

  5. Lars Skjærven reporter

    It's mainly the non-standard residues names which have not been added to our aa.mass list (which also makes calpha ambiguous). Adding more and more is certainly possible though. Ok.. things doesn't look promising for the new approach :)

  6. Barry Grant

    I think the improvements and refinements that you have made throughout these function are great! Both approaches have their weaknesses and we now see them for the first time in practice, which again is tremendously useful for moving forward more robustly ;-)

  7. Xinqiu Yao

    Hi guys,

    I am trying to catch up with the issue. I think for either way we need a test function that can check what we already discussed for any future change. I am trying to list some of the things we have but please add more:

    • Non-protein 'CA' atom recognized as C-alpha (e.g. calcium ion, a PDB id?)
    • Missing 'CA' atoms of unknown amino acids (old way only. PDB id?)
    • Missing 'CA' atoms of amino acids with missing backbone atoms (new way only. 1dpx, example/hivp)
    • Non-protein residue recognized as 'protein' (e.g. MTX in '3DRC')
  8. Lars Skjærven reporter

    Hey, there are some examples already in the test for read.pdb(). For 'unknown' amino acids it will be hard to have a test since we should continously add it to the amino acid residue name table. (see data/aa.table.rda). is that an ok name by the way? since we use for more than just residue masses, I thought it was ok to rename it from aa.mass to aa.table.

  9. Xinqiu Yao

    I saw the lines in test-read.pdb but it is still very limited for atom.select() (there is no test on strings, "ligand", "nucleic", etc., for example). We probably need independent test function for each "essential" function like atom.select(). What do others think?

    For 'unknown' amino acids, I agree it is hard for testing and we leave it for now. The rename of 'aa.mass' to 'aa.table' is good to me, too!

  10. Barry Grant

    Yes we need a new and comprehensive test for this. For unknown residue checking we can change the residue name of one or more positions in an example pdb file that we already have in the package. That is we change to "TES" or something a certain number of positions within the test function and then check if atom.select() responds with indices that fit the number of changes made. Does that make sense?

  11. Xinqiu Yao

    That makes sense! Instead of using currently shipped PDB, I've created a "funny" pdb that contains various special conditions we've discussed here. For example, calcium, unknown AA, missing backbone heavy atom... (See here.) The PDB could be used for multiple purpose testing, e.g. atom.select(), read.pdb(), clean.pdb(), etc.

    I also added a test function for atom.select(). But, unfortunately test failed even on current version. It seems like the problem of "rm.insert" or "rm.alt" options in both read.pdb() and atom.select(). Will check it tomorrow.

    The test function is probably a start, and we could add more in future.

    Let me know what you think!

  12. Lars Skjærven reporter

    Nice. Thanks Xinqiu. Perhaps we can just add it directly to the feature_vmdselect branch? There the rm.alt and rm.insert shouldn't be a problem.

  13. Xinqiu Yao

    Yes, that's right! Probably we should merge master into feature_vmdselect? Do you think it will cause a problem?

  14. Lars Skjærven reporter

    I think only the test file should be merged. Note the conflicts which appeared after I did the revert.

  15. Xinqiu Yao

    I see... Probably I just copy the files into the branch. Meanwhile, I do a small change in master atom.select.pdb() to remove 'rm.insert' option, and then we can test on both branches. Let's see.

  16. Xinqiu Yao

    Okay, I did the test:

    master branch pass all tests.

    > test_file("bio3d/tests/testthat/test-atom.select.R")
    Testing atom.select function : ................................
    

    feature_vmdselect "failed" on selecting 'cbeta'

    > test_file("bio3d/tests/testthat/test-atom.select.R")
    Testing atom.select function : ...................1............
    
    
    1. Failure(@test-atom.select.R#67): atom.select() gets correct selections with various options 
    length(cb2.inds) not equal to 36
    Mean relative difference: 0.8611111
    

    I guess it is because the meaning of 'cbeta' is different between the two versions! On master it is 'N,CA,C,O,CB' but on feature_vmdselect it is 'CB'. Which version should we keep?

    Also, we probably should update Rd file to explain a little bit the meaning of 'strings', unless it is so apparent.

    What do you think?

  17. Lars Skjærven reporter

    ok. updated with

    cbeta       =  .is.protein(pdb)  & .match.elety(pdb, c("CA", "N", "C", "O", "CB")),
    

    Also added a note in atom.select.Rd specifying what "cbeta" selects. Thanks !

  18. Xinqiu Yao

    Thanks! It helps a lot, and it passes all tests now.

    Should we keep the branch a longer time for more testing?

  19. Xinqiu Yao

    Yes, I agree. I actually like the new version of 'atom.select' a lot, which is many folds faster!

  20. Lars Skjærven reporter

    sorry for the messy merge guys. lots of files didn't make it from the feature_vmdselect branch probably due to my previous revert on the master branch. Now they are merged though.

    R CMD check passes. bleh...

  21. Barry Grant

    Can we also pass as CRAN (maybe using Xinqiu's scripts)? If so it would be good to update soon with all these nice improvements and bug fixes.

  22. Barry Grant

    I am marking this as resolved as we recent issues reported on master with the new atom.select() approach and the 'R CMD test' results look good.

  23. Log in to comment