Problems with "rm.insert" option

Issue #166 resolved
Xinqiu Yao created an issue

Hi,

I've run into a problem when I tried to call pdbfit() on a 'pdbs' containing PDB files with "insert" residues:

pdbfit(pdbs, outpath = "fitlsq")
#Error in rot.lsq(xx = pdb.xyz, yy = fixed, xfit = inds, yfit = fixed.inds) : 
#  dimension mismatch in x and y

I found that the error actually occurred in fit.xyz(), the lines of codes:

              string <- paste("//",chains,"/",
                              paste(res.resno, collapse = ","),
                              "///CA/", sep="")
              inds <- atom.select(pdb, string,
                                  verbose=verbose ,rm.insert=TRUE)$xyz

By default, fit.xyz() removes "insert" residues (rm.insert=TRUE), but pdbs contains those residues, and so mismatch here.

The simplest solution for me is to renumber the PDB residues with e.g. convert.pdb(). However, I am wondering if there should be a general solution to deal with similar situations. For example, can we locate residues by a combination of 'chain id', 'resno', and 'insert' code? (In pdbs, we have an extra component of "insert"). Alternatively, we define resno as character instead of number, and so we can have e.g. "100A" and "100" as different residues. The second will change the format of 'pdb' data.frame affecting many related functions, and probably not a good choice.

Any idea? Do you think it is worth doing the change?

Comments (12)

  1. Barry Grant

    If your pdbs object was created with 'rm.insert=FALSE' then fit.xyz() should respect this and give consistent output. Treating resno as a character is a bad idea.

    I suspect that we don't need to include alternate atoms in the selection indices used within fit.xyz() but do we want their moved coordinated returned in xyz? Why would you actually want this?

    Regardless perhaps we should add an option to atom.select() to drop alternates if they exist in the intersection of the selection string. (i.e. add a rm.insert=TRUE option to atom.select())?

    We can then use this within fit.xyz() although this may be just 'kicking the can down the road' and not actually helping to solve your issue as I don't follow what you want to do eventually with these alternate coords.

  2. Xinqiu Yao reporter

    They are actually not alternate coordinates, which are labeled by another field alt. These residues are consecutive on the chain and remove them will break the structure. These "misleading" residue numbering is mostly introduced by the original PDB authors. For example, the 1F1J and 1KMC annotate differently for the same fragment "...NFE..."

    #1F1J
    ATOM     50  N   ASN A  63      72.020  35.607   5.158  1.00 47.56           N
    ATOM     51  CA  ASN A  63      72.973  36.522   4.543  1.00 51.28           C
    ATOM     52  C   ASN A  63      73.281  36.191   3.083  1.00 49.20           C
    ATOM     53  O   ASN A  63      74.249  35.490   2.772  1.00 50.08           O
    ATOM     54  CB  ASN A  63      74.262  36.592   5.367  1.00 58.31           C
    ATOM     55  CG  ASN A  63      75.293  37.544   4.767  1.00 65.94           C
    ATOM     56  OD1 ASN A  63      76.437  37.153   4.504  1.00 73.08           O
    ATOM     57  ND2 ASN A  63      74.895  38.800   4.549  1.00 65.38           N
    ATOM     58  N   PHE A  64      72.417  36.678   2.202  1.00 45.79           N
    ATOM     59  CA  PHE A  64      72.536  36.484   0.766  1.00 43.00           C
    ATOM     60  C   PHE A  64      71.951  37.729   0.160  1.00 43.20           C
    ATOM     61  O   PHE A  64      71.117  38.375   0.791  1.00 46.27           O
    ATOM     62  CB  PHE A  64      71.741  35.265   0.347  1.00 38.99           C
    ATOM     63  CG  PHE A  64      72.256  34.015   0.945  1.00 38.01           C
    ATOM     64  CD1 PHE A  64      73.303  33.338   0.350  1.00 38.50           C
    ATOM     65  CD2 PHE A  64      71.761  33.554   2.153  1.00 39.43           C
    ATOM     66  CE1 PHE A  64      73.858  32.217   0.955  1.00 44.12           C
    ATOM     67  CE2 PHE A  64      72.311  32.437   2.764  1.00 39.41           C
    ATOM     68  CZ  PHE A  64      73.360  31.767   2.167  1.00 39.76           C
    ATOM     69  N   GLU A  65      72.392  38.098  -1.037  1.00 42.81           N
    ATOM     70  CA  GLU A  65      71.868  39.314  -1.649  1.00 41.81           C
    ATOM     71  C   GLU A  65      70.398  39.183  -2.023  1.00 39.86           C
    ATOM     72  O   GLU A  65      69.705  40.183  -2.224  1.00 38.62           O
    ATOM     73  CB  GLU A  65      72.709  39.738  -2.855  1.00 46.32           C
    ATOM     74  CG  GLU A  65      72.826  38.699  -3.942  1.00 52.04           C
    ATOM     75  CD  GLU A  65      73.293  39.292  -5.252  1.00 53.57           C
    ATOM     76  OE1 GLU A  65      74.464  39.722  -5.317  1.00 56.94           O
    ATOM     77  OE2 GLU A  65      72.485  39.325  -6.210  1.00 57.38           O
    
    #1KMC
    ATOM     57  N   ASN A 156       4.104  51.410  29.177  1.00 56.80           N
    ATOM     58  CA  ASN A 156       3.715  51.408  27.777  1.00 56.56           C
    ATOM     59  C   ASN A 156       2.680  52.484  27.465  1.00 55.22           C
    ATOM     60  O   ASN A 156       1.476  52.220  27.437  1.00 57.02           O
    ATOM     61  CB  ASN A 156       3.183  50.028  27.389  1.00 61.20           C
    ATOM     62  CG  ASN A 156       3.231  49.791  25.902  1.00 65.84           C
    ATOM     63  OD1 ASN A 156       2.434  50.346  25.148  1.00 73.14           O
    ATOM     64  ND2 ASN A 156       4.184  48.975  25.464  1.00 69.03           N
    ATOM     65  N   PHE A 156A      3.167  53.703  27.248  1.00 50.78           N
    ATOM     66  CA  PHE A 156A      2.318  54.848  26.925  1.00 49.95           C
    ATOM     67  C   PHE A 156A      2.974  55.716  25.851  1.00 53.17           C
    ATOM     68  O   PHE A 156A      4.199  55.682  25.661  1.00 48.43           O
    ATOM     69  CB  PHE A 156A      2.058  55.720  28.157  1.00 42.84           C
    ATOM     70  CG  PHE A 156A      1.248  55.051  29.223  1.00 37.73           C
    ATOM     71  CD1 PHE A 156A      1.818  54.068  30.042  1.00 39.79           C
    ATOM     72  CD2 PHE A 156A     -0.078  55.426  29.439  1.00 29.45           C
    ATOM     73  CE1 PHE A 156A      1.079  53.463  31.063  1.00 28.96           C
    ATOM     74  CE2 PHE A 156A     -0.833  54.833  30.454  1.00 28.17           C
    ATOM     75  CZ  PHE A 156A     -0.252  53.849  31.274  1.00 30.01           C
    ATOM     76  N   GLU A 161       2.150  56.503  25.165  1.00 51.49           N
    ATOM     77  CA  GLU A 161       2.626  57.380  24.110  1.00 53.58           C
    ATOM     78  C   GLU A 161       3.809  58.225  24.599  1.00 59.31           C
    ATOM     79  O   GLU A 161       4.795  58.409  23.879  1.00 60.90           O
    ATOM     80  CB  GLU A 161       1.472  58.262  23.649  1.00 56.40           C
    ATOM     81  CG  GLU A 161       1.737  59.064  22.407  0.00 63.60           C
    ATOM     82  CD  GLU A 161       0.459  59.628  21.831  0.00 67.17           C
    ATOM     83  OE1 GLU A 161      -0.241  60.371  22.550  0.00 69.29           O
    ATOM     84  OE2 GLU A 161       0.152  59.319  20.661  0.00 69.88           O
    
  3. Barry Grant

    I see, that is different and I guess what we really want is a 'clean.pdb()' option in the future that will somehow sensibly catch and deal with/flag these types of inconsistencies?

  4. Xinqiu Yao reporter

    That is a good idea! What about a clean.pdb option in read.pdb(). If TRUE, we remove 'alt' coords and renumber residue for 'insert'. Meanwhile, we add a flag (e.g. 'clean') in the 'pdb' object to indicate if the pdb is modified from the raw file. We can have the same flag in 'pdbs' object. In related functions, we check the flag and do appropriate operations.

  5. Barry Grant

    Maybe a separate function would be best to start with. This function could then be called within read.pdbs() and pdbfit() etc. with a new flag. I am just cautious about adding to much bloat to read.pdb() before testing things somewhat extensively. What do you think?

  6. Xinqiu Yao reporter

    Sure. I can start with the new function and see how it works. It basically a mini version of 'convert.pdb()' I guess.

  7. Barry Grant

    Sounds good, perhaps we could refine and speed-up convert.pdb() somewhat and then just call it in your new function.

  8. Xinqiu Yao reporter

    Okay. I'll take care of the two. For "cleaning", here are what I am thinking about:

    1. remove 'alt'
    2. renumber residues and atoms if any 'insert' is not NA. Renumbering can be done consecutively cross all chains or restart within each chain, controlled by the argument consecutive (as in the convert.pdb()).
    3. a force.renumber option to do renumbering even if all 'insert' are NA.
    4. map non-standard AA to equivalent standard AA name (this is optional)
    5. add a 'clean=TRUE' flag to the 'pdb' object

    Anything else?

  9. Barry Grant

    Maybe also add a chain record if non present. Plus a detection and print out of possible chain breaks and the option to re-label chains accordingly? Remove water and/or ligands option. Others we can add later.

  10. Xinqiu Yao reporter

    Thanks! The chain record is definitely should be taken into account also. I just wondering if the current 'chain.pdb()' is sufficient to distinguish the breaks of distinct chains from those caused by 'missing residues'. Or, maybe we can tell missing residues from other information like the REMARK in the new version of 'pdb' object format we discussed elsewhere. What do you think?

  11. Barry Grant

    I suggest using the chain.pdb() approach. The remark records are not always reliable in this respect.

  12. Log in to comment