Problems with "rm.insert" option
Hi,
I've run into a problem when I tried to call pdbfit() on a 'pdbs' containing PDB files with "insert" residues:
pdbfit(pdbs, outpath = "fitlsq")
#Error in rot.lsq(xx = pdb.xyz, yy = fixed, xfit = inds, yfit = fixed.inds) :
# dimension mismatch in x and y
I found that the error actually occurred in fit.xyz(), the lines of codes:
string <- paste("//",chains,"/",
paste(res.resno, collapse = ","),
"///CA/", sep="")
inds <- atom.select(pdb, string,
verbose=verbose ,rm.insert=TRUE)$xyz
By default, fit.xyz() removes "insert" residues (rm.insert=TRUE), but pdbs contains those residues, and so mismatch here.
The simplest solution for me is to renumber the PDB residues with e.g. convert.pdb(). However, I am wondering if there should be a general solution to deal with similar situations. For example, can we locate residues by a combination of 'chain id', 'resno', and 'insert' code? (In pdbs, we have an extra component of "insert"). Alternatively, we define resno as character instead of number, and so we can have e.g. "100A" and "100" as different residues. The second will change the format of 'pdb' data.frame affecting many related functions, and probably not a good choice.
Any idea? Do you think it is worth doing the change?
Comments (12)
-
-
reporter They are actually not alternate coordinates, which are labeled by another field alt. These residues are consecutive on the chain and remove them will break the structure. These "misleading" residue numbering is mostly introduced by the original PDB authors. For example, the 1F1J and 1KMC annotate differently for the same fragment "...NFE..."
#1F1J ATOM 50 N ASN A 63 72.020 35.607 5.158 1.00 47.56 N ATOM 51 CA ASN A 63 72.973 36.522 4.543 1.00 51.28 C ATOM 52 C ASN A 63 73.281 36.191 3.083 1.00 49.20 C ATOM 53 O ASN A 63 74.249 35.490 2.772 1.00 50.08 O ATOM 54 CB ASN A 63 74.262 36.592 5.367 1.00 58.31 C ATOM 55 CG ASN A 63 75.293 37.544 4.767 1.00 65.94 C ATOM 56 OD1 ASN A 63 76.437 37.153 4.504 1.00 73.08 O ATOM 57 ND2 ASN A 63 74.895 38.800 4.549 1.00 65.38 N ATOM 58 N PHE A 64 72.417 36.678 2.202 1.00 45.79 N ATOM 59 CA PHE A 64 72.536 36.484 0.766 1.00 43.00 C ATOM 60 C PHE A 64 71.951 37.729 0.160 1.00 43.20 C ATOM 61 O PHE A 64 71.117 38.375 0.791 1.00 46.27 O ATOM 62 CB PHE A 64 71.741 35.265 0.347 1.00 38.99 C ATOM 63 CG PHE A 64 72.256 34.015 0.945 1.00 38.01 C ATOM 64 CD1 PHE A 64 73.303 33.338 0.350 1.00 38.50 C ATOM 65 CD2 PHE A 64 71.761 33.554 2.153 1.00 39.43 C ATOM 66 CE1 PHE A 64 73.858 32.217 0.955 1.00 44.12 C ATOM 67 CE2 PHE A 64 72.311 32.437 2.764 1.00 39.41 C ATOM 68 CZ PHE A 64 73.360 31.767 2.167 1.00 39.76 C ATOM 69 N GLU A 65 72.392 38.098 -1.037 1.00 42.81 N ATOM 70 CA GLU A 65 71.868 39.314 -1.649 1.00 41.81 C ATOM 71 C GLU A 65 70.398 39.183 -2.023 1.00 39.86 C ATOM 72 O GLU A 65 69.705 40.183 -2.224 1.00 38.62 O ATOM 73 CB GLU A 65 72.709 39.738 -2.855 1.00 46.32 C ATOM 74 CG GLU A 65 72.826 38.699 -3.942 1.00 52.04 C ATOM 75 CD GLU A 65 73.293 39.292 -5.252 1.00 53.57 C ATOM 76 OE1 GLU A 65 74.464 39.722 -5.317 1.00 56.94 O ATOM 77 OE2 GLU A 65 72.485 39.325 -6.210 1.00 57.38 O #1KMC ATOM 57 N ASN A 156 4.104 51.410 29.177 1.00 56.80 N ATOM 58 CA ASN A 156 3.715 51.408 27.777 1.00 56.56 C ATOM 59 C ASN A 156 2.680 52.484 27.465 1.00 55.22 C ATOM 60 O ASN A 156 1.476 52.220 27.437 1.00 57.02 O ATOM 61 CB ASN A 156 3.183 50.028 27.389 1.00 61.20 C ATOM 62 CG ASN A 156 3.231 49.791 25.902 1.00 65.84 C ATOM 63 OD1 ASN A 156 2.434 50.346 25.148 1.00 73.14 O ATOM 64 ND2 ASN A 156 4.184 48.975 25.464 1.00 69.03 N ATOM 65 N PHE A 156A 3.167 53.703 27.248 1.00 50.78 N ATOM 66 CA PHE A 156A 2.318 54.848 26.925 1.00 49.95 C ATOM 67 C PHE A 156A 2.974 55.716 25.851 1.00 53.17 C ATOM 68 O PHE A 156A 4.199 55.682 25.661 1.00 48.43 O ATOM 69 CB PHE A 156A 2.058 55.720 28.157 1.00 42.84 C ATOM 70 CG PHE A 156A 1.248 55.051 29.223 1.00 37.73 C ATOM 71 CD1 PHE A 156A 1.818 54.068 30.042 1.00 39.79 C ATOM 72 CD2 PHE A 156A -0.078 55.426 29.439 1.00 29.45 C ATOM 73 CE1 PHE A 156A 1.079 53.463 31.063 1.00 28.96 C ATOM 74 CE2 PHE A 156A -0.833 54.833 30.454 1.00 28.17 C ATOM 75 CZ PHE A 156A -0.252 53.849 31.274 1.00 30.01 C ATOM 76 N GLU A 161 2.150 56.503 25.165 1.00 51.49 N ATOM 77 CA GLU A 161 2.626 57.380 24.110 1.00 53.58 C ATOM 78 C GLU A 161 3.809 58.225 24.599 1.00 59.31 C ATOM 79 O GLU A 161 4.795 58.409 23.879 1.00 60.90 O ATOM 80 CB GLU A 161 1.472 58.262 23.649 1.00 56.40 C ATOM 81 CG GLU A 161 1.737 59.064 22.407 0.00 63.60 C ATOM 82 CD GLU A 161 0.459 59.628 21.831 0.00 67.17 C ATOM 83 OE1 GLU A 161 -0.241 60.371 22.550 0.00 69.29 O ATOM 84 OE2 GLU A 161 0.152 59.319 20.661 0.00 69.88 O
-
I see, that is different and I guess what we really want is a 'clean.pdb()' option in the future that will somehow sensibly catch and deal with/flag these types of inconsistencies?
-
reporter That is a good idea! What about a clean.pdb option in read.pdb(). If TRUE, we remove 'alt' coords and renumber residue for 'insert'. Meanwhile, we add a flag (e.g. 'clean') in the 'pdb' object to indicate if the pdb is modified from the raw file. We can have the same flag in 'pdbs' object. In related functions, we check the flag and do appropriate operations.
-
Maybe a separate function would be best to start with. This function could then be called within read.pdbs() and pdbfit() etc. with a new flag. I am just cautious about adding to much bloat to read.pdb() before testing things somewhat extensively. What do you think?
-
reporter Sure. I can start with the new function and see how it works. It basically a mini version of 'convert.pdb()' I guess.
-
Sounds good, perhaps we could refine and speed-up convert.pdb() somewhat and then just call it in your new function.
-
reporter Okay. I'll take care of the two. For "cleaning", here are what I am thinking about:
- remove 'alt'
- renumber residues and atoms if any 'insert' is not NA. Renumbering can be done consecutively cross all chains or restart within each chain, controlled by the argument consecutive (as in the convert.pdb()).
- a force.renumber option to do renumbering even if all 'insert' are NA.
- map non-standard AA to equivalent standard AA name (this is optional)
- add a 'clean=TRUE' flag to the 'pdb' object
Anything else?
-
Maybe also add a chain record if non present. Plus a detection and print out of possible chain breaks and the option to re-label chains accordingly? Remove water and/or ligands option. Others we can add later.
-
reporter Thanks! The chain record is definitely should be taken into account also. I just wondering if the current 'chain.pdb()' is sufficient to distinguish the breaks of distinct chains from those caused by 'missing residues'. Or, maybe we can tell missing residues from other information like the REMARK in the new version of 'pdb' object format we discussed elsewhere. What do you think?
-
I suggest using the chain.pdb() approach. The remark records are not always reliable in this respect.
-
reporter - changed status to resolved
resolved with the new function clean.pdb()
- Log in to comment
If your pdbs object was created with 'rm.insert=FALSE' then fit.xyz() should respect this and give consistent output. Treating resno as a character is a bad idea.
I suspect that we don't need to include alternate atoms in the selection indices used within fit.xyz() but do we want their moved coordinated returned in xyz? Why would you actually want this?
Regardless perhaps we should add an option to atom.select() to drop alternates if they exist in the intersection of the selection string. (i.e. add a rm.insert=TRUE option to atom.select())?
We can then use this within fit.xyz() although this may be just 'kicking the can down the road' and not actually helping to solve your issue as I don't follow what you want to do eventually with these alternate coords.