How to use Bio3D to carry out a cluster and Principal Components analysis (PCA) of DNA structures

Issue #168 resolved
Former user created an issue

I am interesting to use Bio3D to carry out a cluster and Principal Components analysis (PCA)of some pdb files3D DNA structures. I have several problems when I use the program for DNA structures. My question is: Can I use Bio3D for the analysis of DNA structures or this program is only working with protein structures? I would like to run the PCA and cluster analysis of DNA structures which the PDB codes are : 2MO2, 2MO7, 2LIA, 2L7D, 2KAL, 2K0V, 2Z2H, 2Z2H, 2NPW, 2NQ0, 2NQ0, 2NQ1, 2NQ4, 2HKB. I want to select only the first structure of the NMR file and X-ray for the PCA and analysis and the atoms that I would like to analyze in all structures are: a) the carbons in the bases. b) the nitrogens in the bases c) phospates I am interesting to know the steps to use this program for DNA structures. Thank you very much for your help

Comments (7)

  1. Lars Skjærven

    Hi, This is fully doable with bio3d, but unfortunately it will require a bit more coding than what it would for proteins. The main reason is that the pdbaln() function is not designed for nucleic acids at the moment. Thus, you'll have to go over each pdb file and fetch the corresponding coordinates.

    # fetch pdbs
    pdbcodes = c("2MO2", "2MO7", "2LIA")
    files = get.pdb(pdbcodes)
    
    # matrix to store xyz coordinates
    xyz <- NULL
    
    # for pdb 1
    pdb = read.pdb(files[1])
    sele = atom.select(pdb, "noh", resno=1:2)
    xyz <- rbind(xyz, pdb$xyz[1, sele$xyz])
    
    # for pdb 2
    pdb = read.pdb(files[2])
    sele = atom.select(pdb, "noh", resno=1:2)
    xyz <- rbind(xyz, pdb$xyz[1, sele$xyz])
    
    # for pdb 3
    #pdb = read.pdb(files[3])
    #sele = atom.select(pdb, "noh", resno=1:2)
    #xyz <- rbind(xyz, pdb$xyz[1, sele$xyz])
    
    #etc ...
    
    # superimpose
    xyz = fit.xyz(fixed = xyz[1,], mobile=xyz)
    
    # PCA
    pc = pca.xyz(xyz)
    
    mktrj.pca(pc, pc=1, file="pc1.pdb", 
                   resno=pdb$atom$resno[sele$atom], 
                   elety=pdb$atom$elety[sele$atom],
                   resid=pdb$atom$resid[sele$atom],
    )
    

    This is obviously only for the two first PDBs in your list and only the two first residues, but you get the picture. Use atom.select() to retrieve only the atoms you are interested in (i.e. all pdbs doesn't contain the same sequence).

    Hope this helps! Lars

  2. Rafa del Villar

    Thank you very much for your answer!! I was trying to use your code with this example where I have selected only the guanines of these DNAs file to carry out the PCA analysis. Every is working OK until I try to save the PCA result as a PDB file using the code

    mktrj.pca(pc, pc=1, file="pc1.pdb", 
              resno=pdb$atom$resno[sele$atom], 
              elety=pdb$atom$elety[sele$atom],
              resid=pdb$atom$resid[sele$atom],
    )
    

    This is the error that I got

    Error in write.pdb(xyz = coor, file = file, ...) : 
      write.pdb: the lengths of all input vectors != 'length(xyz)/3'.
    

    But when I used the code

    mktrj.pca(pc, pc=1, file="pc1.pdb")
    

    I get a PDB file like this:

    MODEL 1 ATOM 1 CA ALA 1 9.344 -15.030 14.086 1.00 0.00
    ATOM 2 CA ALA 2 9.650 -15.839 13.823 1.00 0.00
    ATOM 3 CA ALA 3 9.070 -14.059 13.564 1.00 0.00
    ………

    I do not know what the problem is in order to write a PDB file. How can I get a correct PDBfile?

    In addition, I would like to know how I can write the labels (PDB ID) of the DNA structures in the plot of PC1 vs PC2.

    ######Input code
    
    # fetch pdbs
    pdbcodes = c("2MO2", "2MO7", "2LIA", "2L7D") 
    files = get.pdb(pdbcodes)
    # matrix to store xyz coordinates
    xyz <- NULL
    for (n in 1:4)
    {
      pdb = read.pdb(files[n])
      sele = atom.select(pdb, "noh", resid="DG")
      xyz <- rbind(xyz, pdb$xyz[1, sele$xyz])
    
    }
    
    # superimpose
    xyz = fit.xyz(fixed = xyz[1,], mobile=xyz)
    
    # PCA
    pc = pca.xyz(xyz)
    #plot PCA results
    plot(pc)
    
    mktrj.pca(pc, pc=1, file="pc1.pdb", 
              resno=pdb$atom$resno[sele$atom], 
              elety=pdb$atom$elety[sele$atom],
              resid=pdb$atom$resid[sele$atom],
    )
    

    ****output of mktrj.pca(pc, pc=1, file="pc1.pdb", resno=pdb$atom$resno[sele$atom], elety=pdb$atom$elety[sele$atom], resid=pdb$atom$resid[sele$atom],) function is

    Error in write.pdb(xyz = coor, file = file, ...) : 
      write.pdb: the lengths of all input vectors != 'length(xyz)/3'.
    
  3. Lars Skjærven

    Hi Rafa, Note the warnings appearing in your for-loop.

    Warning messages:
    1: In rbind(xyz, pdb$xyz[1, sele$xyz]) :
      number of columns of result is not a multiple of vector length (arg 2)
    2: In rbind(xyz, pdb$xyz[1, sele$xyz]) :
      number of columns of result is not a multiple of vector length (arg 2)
    3: In rbind(xyz, pdb$xyz[1, sele$xyz]) :
      number of columns of result is not a multiple of vector length (arg 2)
    

    The problem is that you select 176, 132, 126, and 132 atoms for the four PDBs, and then attempt to merge the XYZ coordinates of these into one matrix. As I mentioned in the previous reply you need to make sure that you select the same number of atoms for each PDB. That was why I didn't use a for-loop in my example, but wrote the atom selection tailored for each PDB. atom.select(pdb, "noh", resid="DG") selects all atoms within residues with name DG - which would be find if the PDBs had equal number of DG residues. Lars

  4. Barry Grant

    Assumed to be resolved, please re-open if required. We have a long-term, low priority ToDo item for being more friendly to nucleic acid containing structures.

  5. Log in to comment