pca of multichain protein

Issue #527 resolved
Ginni created an issue

Hi

I am new to R and need help with running the pca code for a multi-chain protein. I have a protein with 14 subunits, and I would like to align only certain chains (7-14) of the protein followed by principal component analysis.

Thank you.

Comments (12)

  1. Xinqiu Yao

    Briefly, you need to select a subset of atoms first and then use those atoms to fit structures and do pca. For example,

    # select C-alpha atoms of chains 7-14
    ch <- unique(pdb$atom$chain)
    inds <- atom.select(pdb, 'calpha', chain=ch[7:14])
    
    # fit structures based on the selected atoms. Suppose trajectory is 'xyz'
    xyz.fitted <- fit.xyz(pdb$xyz, xyz, inds$xyz, inds$xyz)
    
    # do pca on all C-alpha atoms
    ca.inds <- atom.select(pdb, 'calpha')
    pc <- pca(xyz.fitted[, ca.inds$xyz)
    

    I haven't tested above codes. Let me know if you have any questions.

    For beginners, I recommend go through bio3d tutorials, especially 'Beginning structure analysis' and 'Principal component analysis'. http://thegrantlab.org/bio3d/tutorials.

    You can find links to learning materials of R here: http://thegrantlab.org/bio3d/user-guide

  2. Lars Skjærven

    Note that for some multimers you will see that the chains are not in the same spatial order in the PDB file relative to each other, so you will have to rearrange them manually using trim.pdb and cat.pdb.

  3. Ginni reporter

    Thank you very much for your help. I have a trivial question regarding the code above.

    I have ~200 structures of a multimeric protein, which I want align on G-K chains. I am getting the structures using get.pdb as in the tutorial, creating raw.files object. When I try to pass it to 'ch' object, it gives me an error- Error in raw.files$atom : $ operator is invalid for atomic vectors

    Thank you again for your help with this.

  4. Xinqiu Yao

    Are the 200 structures from a MD simulation or experiment? The above codes assume your structures are from MD. If you are using X-ray structures, the situation becomes more complicated. As Lars said, you might need to rearrange chains before alignment.

    Let me know what is your case.

  5. Xinqiu Yao

    You need to check whether chains are equivalent and stored in the same order across the 200 structures. Try following codes for the checking:

    # Suppose you have stored all raw pdb files under 'raw_pdbs/'.
    files <- list.files('raw_pdbs', '\\.pdb$', full.names=TRUE)
    pdbs <- pdbaln(files)
    

    The last line of codes should generate a file named 'aln.fa' containing sequence alignments of all the structures. Check the alignment with e.g. 'seaview' or your favorite alignment-visualization programs.

    If the alignment looks okay, move on to fit structures and do PCA:

    # First, select chains G-K
    inds <- which(pdbs$chain[1, ] %in% c('G', 'H', 'I', 'J', 'K'))
    
    # non-gap positions
    gaps.res <- gap.inspect(pdbs$ali)
    gaps.pos <- gap.inspect(pdbs$xyz)
    
    # combine the chain-selection and non-gap positions.
    inds <- intersect(inds, gaps.res$f.inds)
    
    # fit structures based on the selected chains
    xyz.fitted <- fit.xyz(pdbs$xyz[1, ], pdbs, atom2xyz(inds), atom2xyz(inds))
    
    # do pca for all chains
    pc <- pca(xyz.fitted[, gaps.pos$f.inds])
    

    I haven't tested the codes. Let me know if you run into any problems.

  6. Lars Skjærven

    But this will not account for difference in spatial rearrangements. Consider the following example:

    files = c("5den", "5fgj", "5egq")
    pdbs = pdbaln(files)
    xyz = pdbfit(pdbs, outpath="pdbfit")
    

    Although the 3 PDBs are of the same protein, they clearly don't superimpose.

    What you need to do is to reassign the chain IDs so that they have the same spatial arrangement:

    # we'll use 5den as template
    get.pdb("5den")
    
    # for 5fgj, swap chains C and D
    p <- read.pdb("5fgj")
    inds1 <- atom.select(p, chain="C")
    inds2 <- atom.select(p, chain="D")
    p$atom$chain[inds1$atom] = "D"
    p$atom$chain[inds2$atom] = "C"
    
    n <- cat.pdb(trim(p, chain="A"),
                 trim(p, chain="B"),
                 trim(p, chain="C"),
                 trim(p, chain="D")) 
    n$atom$eleno <- 1:nrow(n$atom)
    write.pdb(n, "5fgj_rechain.pdb")
    
    # for 5egq, swap chains C and D
    p <- read.pdb("5egq")
    inds1 <- atom.select(p, chain="C")
    inds2 <- atom.select(p, chain="D")
    p$atom$chain[inds1$atom] = "D"
    p$atom$chain[inds2$atom] = "C"
    
    n <- cat.pdb(trim(p, chain="A"),
                 trim(p, chain="B"),
                 trim(p, chain="C"),
                 trim(p, chain="D")) 
    n$atom$eleno <- 1:nrow(n$atom)
    write.pdb(n, "5egq_rechain.pdb")
    
    
    # re-read rechained PDBs
    files <- c("5den.pdb", "5egq_rechain.pdb", "5fgj_rechain.pdb")
    pdbs <- pdbaln(files)
    pdbs$xyz = pdbfit(pdbs, outpath="pdbfit2")
    
    # finally we can do PCA!
    core = core.cmap(pdbs)
    pdbs$xyz = pdbfit(pdbs, core, outpath="pdbfit2")
    pc.xray = pca(pdbs, use.svd=TRUE)
    mktrj(pc.xray)
    

    Since this is surely not feasible for 200+ structures, we have to design another approach for this. That's one problem, but also aligning multichain PDBs might easily lead to alignment errors. e.g. in the above example the N terminal of one chain aligns with a residue from the C terminal of the adjacent chain. A workaround for that problem might be (given identical chains):

    splitfiles <- pdbsplit(files)
    pdbs0 <- pdbaln(splitfiles)
    
    ali = NULL
    for( i in seq(1, nrow(pdbs0$ali), by=nrow(pdbs0$ali)/length(files))) {
      f = c(pdbs0$ali[i,], pdbs0$ali[i+1,], pdbs0$ali[i+2,], pdbs0$ali[i+3, ])
      ali = rbind(ali, f)
    }
    
    ali = as.fasta(ali)
    ali$id = files
    gaps = gap.inspect(ali$ali)
    ali$ali = ali$ali[, gaps$col < nrow(ali$ali)]
    pdbs = read.fasta.pdb(aln = ali)
    
    # redo PCA- 
    core = core.cmap(pdbs)
    pdbs$xyz = pdbfit(pdbs, core, outpath="pdbfit3")
    pc.xray = pca(pdbs, use.svd=TRUE)
    mktrj(pc.xray)
    

    hmm..

  7. Ginni reporter

    Thank you very much for your time and detailed explanation.

    I arranged the chains in the pdbs such that they are similar in all the proteins and are also spatially equivalent and then tried the code above (modifying a few parts) and it works great.

    Thank you again for your help.

  8. Ginni reporter

    Hi,

    I have one more question regarding making the PDB format trajectory of the Principal Components. When I use the mktraj function, I get the trajectory file which includes CA atoms and residues are all labeled Ala. How can I get a trajectory with original residue number or identity ?

    Thank you very much.

  9. Xinqiu Yao

    You can provide a pdb object in mktrj() to use as a reference. Check ?mktrj, which gives detailed description.

  10. Log in to comment