pca of multichain protein

Issue #527 resolved

Ginni created an issue 2017-11-07

I am new to R and need help with running the pca code for a multi-chain protein. I have a protein with 14 subunits, and I would like to align only certain chains (7-14) of the protein followed by principal component analysis.

Thank you.

Comments (12)

Xinqiu Yao
Briefly, you need to select a subset of atoms first and then use those atoms to fit structures and do pca. For example,
```
# select C-alpha atoms of chains 7-14
ch <- unique(pdb$atom$chain)
inds <- atom.select(pdb, 'calpha', chain=ch[7:14])

# fit structures based on the selected atoms. Suppose trajectory is 'xyz'
xyz.fitted <- fit.xyz(pdb$xyz, xyz, inds$xyz, inds$xyz)

# do pca on all C-alpha atoms
ca.inds <- atom.select(pdb, 'calpha')
pc <- pca(xyz.fitted[, ca.inds$xyz)
```
I haven't tested above codes. Let me know if you have any questions.

For beginners, I recommend go through bio3d tutorials, especially 'Beginning structure analysis' and 'Principal component analysis'. http://thegrantlab.org/bio3d/tutorials.

You can find links to learning materials of R here: http://thegrantlab.org/bio3d/user-guide
- 2017-11-07T21:46:24+00:00
Lars Skjærven
Note that for some multimers you will see that the chains are not in the same spatial order in the PDB file relative to each other, so you will have to rearrange them manually using trim.pdb and cat.pdb.
- 2017-11-08T20:36:48+00:00
Lars Skjærven
- marked as task
- 2017-11-09T19:53:47+00:00
Ginni reporter
Thank you very much for your help. I have a trivial question regarding the code above.

I have ~200 structures of a multimeric protein, which I want align on G-K chains. I am getting the structures using get.pdb as in the tutorial, creating raw.files object. When I try to pass it to 'ch' object, it gives me an error- Error in raw.files$atom : $ operator is invalid for atomic vectors

Thank you again for your help with this.
- 2017-11-10T17:33:28+00:00
Xinqiu Yao
Are the 200 structures from a MD simulation or experiment? The above codes assume your structures are from MD. If you are using X-ray structures, the situation becomes more complicated. As Lars said, you might need to rearrange chains before alignment.

Let me know what is your case.
- 2017-11-10T18:01:12+00:00
Ginni reporter
Thank you...These are X-ray and EM structures.
- 2017-11-10T18:16:35+00:00

Xinqiu Yao

You need to check whether chains are equivalent and stored in the same order across the 200 structures. Try following codes for the checking:

# Suppose you have stored all raw pdb files under 'raw_pdbs/'.
files <- list.files('raw_pdbs', '\\.pdb$', full.names=TRUE)
pdbs <- pdbaln(files)

The last line of codes should generate a file named 'aln.fa' containing sequence alignments of all the structures. Check the alignment with e.g. 'seaview' or your favorite alignment-visualization programs.

If the alignment looks okay, move on to fit structures and do PCA:

# First, select chains G-K
inds <- which(pdbs$chain[1, ] %in% c('G', 'H', 'I', 'J', 'K'))

# non-gap positions
gaps.res <- gap.inspect(pdbs$ali)
gaps.pos <- gap.inspect(pdbs$xyz)

# combine the chain-selection and non-gap positions.
inds <- intersect(inds, gaps.res$f.inds)

# fit structures based on the selected chains
xyz.fitted <- fit.xyz(pdbs$xyz[1, ], pdbs, atom2xyz(inds), atom2xyz(inds))

# do pca for all chains
pc <- pca(xyz.fitted[, gaps.pos$f.inds])

I haven't tested the codes. Let me know if you run into any problems.

2017-11-13T17:15:57+00:00

Lars Skjærven

But this will not account for difference in spatial rearrangements. Consider the following example:

files = c("5den", "5fgj", "5egq")
pdbs = pdbaln(files)
xyz = pdbfit(pdbs, outpath="pdbfit")

Although the 3 PDBs are of the same protein, they clearly don't superimpose.

What you need to do is to reassign the chain IDs so that they have the same spatial arrangement:

# we'll use 5den as template
get.pdb("5den")

# for 5fgj, swap chains C and D
p <- read.pdb("5fgj")
inds1 <- atom.select(p, chain="C")
inds2 <- atom.select(p, chain="D")
p$atom$chain[inds1$atom] = "D"
p$atom$chain[inds2$atom] = "C"

n <- cat.pdb(trim(p, chain="A"),
             trim(p, chain="B"),
             trim(p, chain="C"),
             trim(p, chain="D")) 
n$atom$eleno <- 1:nrow(n$atom)
write.pdb(n, "5fgj_rechain.pdb")

# for 5egq, swap chains C and D
p <- read.pdb("5egq")
inds1 <- atom.select(p, chain="C")
inds2 <- atom.select(p, chain="D")
p$atom$chain[inds1$atom] = "D"
p$atom$chain[inds2$atom] = "C"

n <- cat.pdb(trim(p, chain="A"),
             trim(p, chain="B"),
             trim(p, chain="C"),
             trim(p, chain="D")) 
n$atom$eleno <- 1:nrow(n$atom)
write.pdb(n, "5egq_rechain.pdb")


# re-read rechained PDBs
files <- c("5den.pdb", "5egq_rechain.pdb", "5fgj_rechain.pdb")
pdbs <- pdbaln(files)
pdbs$xyz = pdbfit(pdbs, outpath="pdbfit2")

# finally we can do PCA!
core = core.cmap(pdbs)
pdbs$xyz = pdbfit(pdbs, core, outpath="pdbfit2")
pc.xray = pca(pdbs, use.svd=TRUE)
mktrj(pc.xray)

Since this is surely not feasible for 200+ structures, we have to design another approach for this. That's one problem, but also aligning multichain PDBs might easily lead to alignment errors. e.g. in the above example the N terminal of one chain aligns with a residue from the C terminal of the adjacent chain. A workaround for that problem might be (given identical chains):

splitfiles <- pdbsplit(files)
pdbs0 <- pdbaln(splitfiles)

ali = NULL
for( i in seq(1, nrow(pdbs0$ali), by=nrow(pdbs0$ali)/length(files))) {
  f = c(pdbs0$ali[i,], pdbs0$ali[i+1,], pdbs0$ali[i+2,], pdbs0$ali[i+3, ])
  ali = rbind(ali, f)
}

ali = as.fasta(ali)
ali$id = files
gaps = gap.inspect(ali$ali)
ali$ali = ali$ali[, gaps$col < nrow(ali$ali)]
pdbs = read.fasta.pdb(aln = ali)

# redo PCA- 
core = core.cmap(pdbs)
pdbs$xyz = pdbfit(pdbs, core, outpath="pdbfit3")
pc.xray = pca(pdbs, use.svd=TRUE)
mktrj(pc.xray)

hmm..

2017-11-13T18:24:15+00:00

Ginni reporter
Thank you very much for your time and detailed explanation.

I arranged the chains in the pdbs such that they are similar in all the proteins and are also spatially equivalent and then tried the code above (modifying a few parts) and it works great.

Thank you again for your help.
- 2017-11-14T00:08:28+00:00
Lars Skjærven
- changed status to resolved
- 2017-11-16T08:28:39+00:00
Ginni reporter
Hi,

I have one more question regarding making the PDB format trajectory of the Principal Components. When I use the mktraj function, I get the trajectory file which includes CA atoms and residues are all labeled Ala. How can I get a trajectory with original residue number or identity ?

Thank you very much.
- 2017-11-16T21:44:52+00:00
Xinqiu Yao
You can provide a pdb object in mktrj() to use as a reference. Check ?mktrj, which gives detailed description.
- 2017-11-17T00:12:38+00:00
Log in to comment

Assignee: –

Type: task

Priority: minor

Status: resolved

Component: –

Version: –

Votes: 0

Watchers: 1