PCA for user-defined binding site

Issue #379 resolved

skal24 created an issue 2016-07-22

Hi there,

I am trying to do a pca analysis for an ensemble of structures to know and cluster the conformations of their binding sites. I read the pdb file using the multi=TRUE option and defined the binding site residues using bsite.inds<-atom.select(pdb, "noh", resno=c(243,239,210, .....)) (Note: I need to use the "noh" and not just "CA")

I fitted the ensemble using, xyz<-fit.xyz(fixed=pdb$xyz[1,], mobile=pdb,fixed.inds=bsite.inds$xyz, mobile.inds=bsite.inds$xyz)

and did a pca using, pc<-pca.xyz(xyz[,bsite.inds$xyz])

In order to see the contribution of individual residues on each PCs I used, plot.bio3d(pc$au[,1], resno=bsite.inds$resno, ylab="PC1")

Hope, what I am doing is right till now.

The issue here is, I am unable to get the residue contribution based on the resno, i.e., the x-axis doesn't correspond to the residue no, but to the number of atoms(I guess). How can I get this done?

Thanks and Best Regards, Subha

Comments (21)

Xinqiu Yao
Hi,

The atom select object bsite.inds contains atomic indices only, which has no component such as '$resno'. You can create a subset pdb first and then provide the pdb for residue numbers in calling 'plot.bio3d()'. For example,
```
bsite <- atom.select(pdb, 'noh', resno=c(243, 239, 210, ...), value=TRUE) # Note the value=TRUE. 
plot.bio3d(pc$au[,1], resno=bsite, ylab="PC1")
```
Let me know if it works.
- 2016-07-25T14:30:42+00:00
Xinqiu Yao
You still need the bsite.inds that is generated with 'value=FALSE' for fitting and PCA. That means, you need two "atom.select" operations, one is for 'bsite.inds' as you did in the original code, and the other is for 'bsite' as I described above. Use 'bsite.inds' for fitting and PCA, and use 'bsite' for 'plot.bio3d()'.
- 2016-07-25T16:13:46+00:00
skal24 reporter
Hi Yao,

pdb<-read.pdb("allconf.pdb", multi=TRUE)

bsite.inds<-atom.select(pdb,'noh', resno=c(243,239,210,211,212,144,235,498,494,465,466,467,399,490,753,749,720,721,722,654,745,1008,1004,975,976,977,909,1000)

bsite<-atom.select(pdb,'noh', resno=c(243,239,210,211,212,144,235,498,494,465,466,467,399,490,753,749,720,721,722,654,745,1008,1004,975,976,977,909,1000), value=TRUE)

xyz<-fit.xyz(fixed=pdb$xyz[1,], mobile=pdb,fixed.inds<-bsite.inds$xyz, mobile.inds<-bsite.inds$xyz)

pc<-pca.xyz(xyz[,bsite.inds$xyz])

plot.bio3d(pc$au[,1], resno=bsite, ylab="PC1")

Warning message: In plotb3(...) : Length of input 'resno' does not equal the length of input 'x'; Ignoring 'resin'

The plot doesn't include the residue numbers
- 2016-07-25T16:15:17+00:00
Xinqiu Yao
Try plot.bio3d(pc$au[, 1], resno=bsite$resno, ylab='PC1'). The previous version assumes C-alpha PCA by default, and you are actually doing all-atom PCA.
- 2016-07-25T16:27:46+00:00
skal24 reporter
Hi Yao,

Doesn't work, the plot looks the same.

Thanks, Subha
- 2016-07-25T16:33:01+00:00
Xinqiu Yao
Forgot inserting 'atom'. What about plot.bio3d(pc$au[, 1], resno=bsite$atom$resno, ylab='PC1')?

Btw, I don't think such labeling could help to figure out residue wise contributions because the plot itself is atom wise anyway. An alternative way could be grouping atomic loadings by residue number and picking up the maximal value per residue for plotting residue wise loadings. For example,
```
au1 <- tapply(pc$au[, 1], bsite$atom$resno, max)
plot.bio3d(au1, resno=bsite)
```
- 2016-07-25T16:51:41+00:00
skal24 reporter
I used to calculate the mean of four atoms if I am using the backbone atoms for my PCA. I was wondering how to get this done for an all-atom plot. Using the max is a great suggestion. However, I am getting an error,

Error in matrix(colSums(rbind(gap.pos1, gap.pos2), na.rm = TRUE), ncol = ncol(x)) : invalid 'ncol' value (too large or NA)
- 2016-07-25T17:36:03+00:00
Xinqiu Yao
The idea is to group atoms by residue number (this assumes each residue has a unique residue number) and take either max or mean for each group. This is done with the tapply() function (See ?tapply for more details). In the above command 'mean' can be a substitute to the 'max' to have it calculate mean loading of each residue.

The error you've encountered looks weird. What command caused the error? Also, could you provide an example file (e.g. the pdb file) that I can reproduce the error?
- 2016-07-25T18:32:25+00:00
skal24 reporter
I got that error when I used,

au1 <- tapply(pc$au[, 1], bsite$atom$resno, max)

plot.bio3d(au1, resno=bsite)

I have attached a sample file.

Thanks for the support. Subha
- 2016-07-25T18:54:54+00:00
skal24 reporter
- attached sample.tar
- 2016-07-25T18:55:07+00:00
Xinqiu Yao
Okay. I've reproduced the error and it is the 'class' attribute of 'au1' that does not match the requirement of 'plot.bio3d()'. Simply adding au1 <- as.numeric(au1) after 'au1 <- tapply(...)' will solve the problem.
- 2016-07-25T19:13:10+00:00
skal24 reporter
Great! It's working.. Thanks a lot

Subha
- 2016-07-25T20:18:08+00:00
skal24 reporter
Just got another question,

Is it possible to get the closest structure from every pc space (and not just pc space 1:2) ?

Thanks
- 2016-07-25T20:51:16+00:00
Xinqiu Yao
Could you tell a bit more about what you are trying to do? What do you mean "closest structure"?
- 2016-07-27T01:19:51+00:00
skal24 reporter
I am trying to cluster the conformations that I have into groups based on the PCA analysis carried out for the binding site residues. The scree plot shows that only 53% of variance has been captured in the first three PCs. So, I would like cluster the structures from at least 5 PCs (below which we get less than 5% variance, so can be considered statistically insignificant).

My first question is how can I do this?

I presume this can be done by

hc <- hclust(dist(pc$z[,1:5])) (Please correct me , if I am wrong)

Once, clustering the PCs from first five PCs (based on an optimal value of k or h), I need to get the structures (or frames, from my multi pdb file) that has a minimum and maximum deviation in each cluster.

Going through previous discussions, I found that we can get the average of each group and identify the closest frame for each group using,

Calculate the mean of the first group and write to a PDB file

gp1.ave <- colMeans(xyz[grps==1,]) write.pdb(pdb=pdb, xyz=gp1.ave, file="AverageStructure_group1.pdb")

Find the closest frame (here we take the smallest RMSD) to the group1 average structure

gp1.min.ind <- which.min(rmsd(gp1.ave, xyz[grps==1,])) gp1.min.xyz <- xyz[grps==1,][gp1.min.ind,]

write.pdb(pdb=pdb, xyz=gp1.min.xyz, file="FrameRepresentative_group1.pdb")

However, this doesn't work for identifying closest structure from clusters obtained from 5PCs.

Sorry for the long description.

Thanks, Subha
- 2016-07-27T17:43:27+00:00
skal24 reporter
- attached Rplot-pc_bindingsite.pdf
- 2016-07-27T17:45:54+00:00
Xinqiu Yao
Yes, you are right of the commands for clustering using the first 5 PCs and subsequent choosing the closest frame to the mean in each group. In this case, the clustering is based on 5 PCs while choosing the representative frame is based on all PCs (RMSD is equivalent to the distance in space spanned by all PCs). This is actually a common practice and no need to worry about.

However, in a similar way you can still identify the representative in the same space in which the clustering is performed. What you need to do is to replace the xyz with pc$z[, 1:5] and use dist() instead of rmsd() in the above commands for calculating the mean and the distance between mean and every frame. For example,
```
gp1.ave <- colMeans(pc$z[grps==1,1:5])
dm <- as.matrix(dist(rbind(gp1.ave, pc$z[grps==1, 1:5]))
gp1.min.ind <- which.min(dm[1, 2:ncol(dm)]) 
gp1.min.xyz <- xyz[grps==1,][gp1.min.ind,]
write.pdb(pdb=pdb, xyz=gp1.min.xyz, file="FrameRepresentative_group1.pdb")
```
Of course, in this way you can't write the average structure for the group.
- 2016-07-27T18:55:07+00:00
skal24 reporter
Hi Yao,

Thanks a lot. However, I don't understand why I am unable to write the representative structures for group numbers that are larger than the number of PCs I used for clustering.

For example, When I perform clustering on the PC space 1:2 and divide them into say 5 clusters and try to write the representatives of the cluster 3, 4 and 5 using the above methods I get an error

Error in colMeans(xyz[grps == 3, ]) : 'x' must be an array of at least two dimensions

Can you pl explain why it is so?

Best Regards, Subha
- 2016-07-28T20:44:18+00:00
Xinqiu Yao
It looks the group #3 has only one member (type sum(grps==3) to see if it is 1). In this case, the syntax xyz[grps==3, ] reduces the xyz matrix to a vector, which caused 'colMeans()' to complain about the dimension. To solve it, use xyz[grps==3, ,drop=FALSE] instead.

Alternatively, you may just exclude these clusters with only one (or a number below a threshold) member from further analysis.
- 2016-07-28T21:36:21+00:00
skal24 reporter
Great! Thanks
- 2016-07-28T22:36:59+00:00
Xinqiu Yao
- changed status to resolved
- 2016-09-12T14:38:28+00:00
Log in to comment

Assignee: –

Type: task

Priority: minor

Status: resolved

Component: –

Version: –

Votes: 0

Watchers: 1

Comments (21)

Calculate the mean of the first group and write to a PDB file

Find the closest frame (here we take the smallest RMSD) to the group1 average structure