Missing Residues in PCA Analysis

Issue #951 resolved
Christos Efthymiou created an issue

Hi, I am running Bio3D on a supercomputer to perform analysis of some simulations. In particular, I am doing PCA analysis on a DCD file which combines several wildtype and mutant simulations. I used catdcd to write each dcd file into just its backbone and then used catdcd again to combine all of the simulations into a single dcd file. This is the code I am using right now:

# Load the required libraries
library(bio3d)

# Set the file paths
dcd <- "all_combined.dcd"
pdb <- "backbone.pdb"

# Read the trajectory and reference structure
traj <- read.dcd(dcd)
ref <- read.pdb(pdb)

# Select CA atoms
ca.inds <- atom.select(ref, elety = "CA")

# Superpose trajectory onto the reference structure
xyz <- fit.xyz(fixed = ref$xyz, mobile = traj, fixed.inds = ca.inds$xyz, mobile.inds = ca.inds$xyz)

# Perform PCA analysis
pc <- pca.xyz(xyz[, ca.inds$xyz])
print(pc)
plot(pc, col=bwr.colors(nrow(xyz)) )
write.csv(pc$z[,1:3], file="MyOutputFile.csv", quote=FALSE)

hc <- hclust(dist(pc$z[,1:2]))
grps <- cutree(hc, k=2)
plot(pc, col=grps)

plot.bio3d(pc$au[,1], ylab="PC1 (A)", xlab="Residue Position", typ="l")
points(pc$au[,2], typ="l", col="blue")

p1 <- mktrj.pca(pc, pc=1, b=pc$au[,1], file="pc1.pdb")
p2 <- mktrj.pca(pc, pc=2,b=pc$au[,2], file="pc2.pdb")

Mostly everything runs as expected, however when I look at the third plot, it seems that some residues are missing from the trajectory:

Checking the error log from the supercomputer, I see the following:

Warning messages:
1: In readChar(trj, 80) : truncating string with embedded nuls
2: In readChar(trj, 80) : truncating string with embedded nuls
3: In dcd.header(trj, verbose) :
Check DCD header data is correct, particulary natom

However, if I load the same dcd file into VMD, I don’t see any missing residues and the number of atoms matches what I expected from the index file I created for the backbone. How can I resolve this issue? I cannot attach the DCD file here as it is too large.

Thank you for any advice!

Comments (13)

  1. Xinqiu Yao

    Hi,

    Can you check the dimensions of traj and ref by simply typing their names and return? Do they have the same number of atoms?

    Also, can you locate the residues that you suspect have missing data? If so, check if the coordinates of those residues are correct.

  2. Christos Efthymiou reporter

    Yes, I checked the dimensions of traj and ref and they have the same number of atoms:

    Total Frames#: 45000
    Total XYZs#: 5034, (Atoms#: 1678)

    [1]  31.814  -1.189  -10.996  <...>  0  0  0  [226530000]
    
    • attr: Matrix DIM = 45000 x 5034

    Call: read.pdb(file = pdb)

    Total Models#: 1
    Total Atoms#: 1678, XYZs#: 5034 Chains#: 2 (values: A E)

     Protein Atoms#: 1678  (residues/Calpha atoms#: 419)
     Nucleic acid Atoms#: 0  (residues/phosphate atoms#: 0)
    
     Non-protein/nucleic Atoms#: 0  (residues: 0)
     Non-protein/nucleic resid values: [ none ]
    

    The only reason I am suspecting there are missing residues is because of the straight horizontal lines in the plot above like residues 110-150 and because of the error message from above. How would I check the coordinates of those residues and make sure they are correct?

  3. Xinqiu Yao

    Yes, I do see the horizontal lines, too. You can print out the values of pc$au to locate the residues. For example, just print pc$au[100:150, 3] and find out which residues have identical values. Suppose one of them is 120, check traj of the same residue: traj[, ca.inds$xyz][, atom2xyz(120)]. You can probably plot the coordinates as lines to visually check.

  4. Christos Efthymiou reporter

    Could you please provide more guidance on how to plot the coordinates to check if they are the same? I am very new to R and I am not sure how to do this properly.

  5. Christos Efthymiou reporter

    I tried to figure this out, and this is the code that I used:

    # Extract PCA coordinates for a specific range (e.g., residues 100 to 150)
    residues_range <- 100:150
    pca_coords <- pc$au[residues_range, 3]
    
    # Find residues with identical PCA coordinates
    residues_with_identical_coords <- residues_range[duplicated(pca_coords)]
    
    # Plot the trajectory coordinates for the identified residues
    par(mfrow = c(1, 1))  # Reset the plot layout
    
    for (residue_index in residues_with_identical_coords) {
      coordinates <- traj[, ca.inds$xyz[residue_index]]
      plot(coordinates, type = "l", main = paste("Residue", residue_index), xlab = "Frame", ylab = "Coordinate")
    }
    

    Here is a link to the PDF of the plots that were generated (https://drive.google.com/file/d/1AONRFf7ic9bAfCpxhtZEZcB4qMoDfiHC/view?usp=sharing), but here are a couple examples:

    I am not entirely sure how to interpret these line graphs or how to fix the issue. Thanks for your help!

  6. Xinqiu Yao

    First of all, make sure the code ca.inds$xyz[residue_index]is what you want. In bio3d, ‘xyz’ is saved as triplets (each for ‘x', ‘y’, and ‘z’ coordinate) and so the length of “ca.inds$xyz” is three times longer than the range of residue numbers. To get the proper 'xyz’ index, you need to call the atom2xyz()function, which converts “residue_index” to the corresponding xyz. For example, use ca.inds$xyz[atom2xyz(residue_index)].

    Please correct that before we dig into the result.

  7. Christos Efthymiou reporter

    I changed the code to the following (editing line 12):

    # Extract PCA coordinates for a specific range (e.g., residues 100 to 150)
    residues_range <- 100:150
    pca_coords <- pc$au[residues_range, 3]
    
    # Find residues with identical PCA coordinates
    residues_with_identical_coords <- residues_range[duplicated(pca_coords)]
    
    # Plot the trajectory coordinates for the identified residues
    par(mfrow = c(1, 1))  # Reset the plot layout
    
    for (residue_index in residues_with_identical_coords) {
      coordinates <- traj[, ca.inds$xyz[atom2xyz(residue_index)]]
      plot(coordinates, type = "l", main = paste("Residue", residue_index), xlab = "Frame", ylab = "Coordinate")
    }
    

    But now the coordinate plots are empty:

    Is there something else I should fix in the code?

    I am also a bit confused as to how there can be missing residues in the PCA analysis when the dimensions of the trajectory and pdb file are correct. Wouldn’t that suggest there must be an error in the portion of the code for performing PCA, not that there is an issue with the files or the formats? The error warning recognizes that the number of atoms it finds does not match the header of the dcd file, but it still pulls all the atoms.

  8. Xinqiu Yao

    Are there any error or warning messeages besides the one regarding DCD?

    Also, instead of making a plot, can you simply print out the coordinate (for just one residue) and see if values in the trajectory are fine? For example, simply type traj[, ca.inds$xyz[atom2xyz(residues_with_identical_coords[1])]] and hit return. The empty plot may suggest something wrong in the DCD…

  9. Christos Efthymiou reporter

    I typed exactly what you suggested above and see the following:

             [,1] [,2] [,3]
        [1,]    0    0    0
        [2,]    0    0    0
        [3,]    0    0    0
        [4,]    0    0    0
        [5,]    0    0    0
        [6,]    0    0    0
        [7,]    0    0    0
        [8,]    0    0    0
        [9,]    0    0    0
       [10,]    0    0    0
       [11,]    0    0    0
       [12,]    0    0    0
       [13,]    0    0    0
       [14,]    0    0    0
       [15,]    0    0    0
       [16,]    0    0    0
       [17,]    0    0    0
       [18,]    0    0    0
       [19,]    0    0    0
       [20,]    0    0    0
    

    That continues for all the frames. I assume this means the DCD file does in fact have a problem where these residues do not have coordinates in the trajectory?

    I have double checked some of the output files when I was writing the trajectories to just their backbone and then combining them, and I see the following:

    Info) Dynamically loaded 2 plugins in directory:
    Info) /shared/ucl/apps/vmd/1.9.3/text-only/lib/plugins/LINUXAMD64/molfile
    *** Error in `/shared/ucl/apps/vmd/1.9.3/text-only/lib/plugins/LINUXAMD64/bin/catdcd5.1/catdcd': double free or corruption (!prev): 0x00000000010d1140 ***
    ======= Backtrace: =========
    /lib64/libc.so.6(+0x81329)[0x2af553da0329]
    /shared/ucl/apps/vmd/1.9.3/text-only/lib/plugins/LINUXAMD64/bin/catdcd5.1/catdcd(atan+0x1b93)[0x40a743]
    /lib64/libc.so.6(__libc_start_main+0xf5)[0x2af553d41555]
    /shared/ucl/apps/vmd/1.9.3/text-only/lib/plugins/LINUXAMD64/bin/catdcd5.1/catdcd(fmod+0x8a)[0x408e9a]
    ======= Memory map: ========
    00400000-005ab000 r-xp 00000000 67:70864 144115274109512152              /lustre/shared/ucl/apps/vmd/1.9.3/text-only/lib/plugins/LINUXAMD64/bin/catdcd5.1/catdcd
    006aa000-006bd000 rw-p 001aa000 67:70864 144115274109512152              /lustre/shared/ucl/apps/vmd/1.9.3/text-only/lib/plugins/LINUXAMD64/bin/catdcd5.1/catdcd
    006bd000-006c7000 rw-p 00000000 00:00 0
    010c3000-01100000 rw-p 00000000 00:00 0                                  [heap]
    3f8a200000-3f8a220000 r-xp 00000000 67:70864 144115274109512127          /lustre/shared/ucl/apps/vmd/1.9.3/text-only/lib/libexpat.so.0
    3f8a220000-3f8a320000 ---p 00020000 67:70864 144115274109512127          /lustre/shared/ucl/apps/vmd/1.9.3/text-only/lib/libexpat.so.0
    3f8a320000-3f8a322000 rw-p 00020000 67:70864 144115274109512127  
    /shared/ucl/apps/vmd/1.9.3/text-only/lib/libexpat.so.0
    2af5530d4000-2af5530f6000 r-xp 00000000 08:03 148055                     /usr/lib64/ld-2.17.so
    2af5530f6000-2af5530fb000 rw-p 00000000 00:00 0
    2af55310e000-2af55310f000 rw-p 00000000 00:00 0
    2af55310f000-2af55319b000 r-xp 00000000 67:70864 144115274109512138      /lustre/shared/ucl/apps/vmd/1.9.3/text-only/lib/libsqlite3.so.0
    2af55319b000-2af55329a000 ---p 0008c000 67:70864 144115274109512138      /lustre/shared/ucl/apps/vmd/1.9.3/text-only/lib/libsqlite3.so.0
    2af55329a000-2af55329e000 rw-p 0008b000 67:70864 144115274109512138      /lustre/shared/ucl/apps/vmd/1.9.3/text-only/lib/libsqlite3.so.0
    2af55329e000-2af5532a4000 rw-p 00000000 00:00 0
    2af5532f5000-2af5532f6000 r--p 00021000 08:03 148055                     /usr/lib64/ld-2.17.so
    2af5532f6000-2af5532f7000 rw-p 00022000 08:03 148055                     /usr/lib64/ld-2.17.so
    2af5532f7000-2af5532f8000 rw-p 00000000 00:00 0
    2af5532f8000-2af5532fa000 r-xp 00000000 08:03 148247                     /usr/lib64/libdl-2.17.so
    2af5532fa000-2af5534fa000 ---p 00002000 08:03 148247                     /usr/lib64/libdl-2.17.so
    2af5534fa000-2af5534fb000 r--p 00002000 08:03 148247                     /usr/lib64/libdl-2.17.so
     2af5534fb000-2af5534fc000 rw-p 00003000 08:03 148247                     /usr/lib64/libdl-2.17.so
    2af5534fc000-2af5535e7000 r-xp 00000000 67:70864 144115273706872383      /lustre/shared/ucl/apps/gcc/4.9.2/lib64/libstdc++.so.6.0.20
    2af5535e7000-2af5537e7000 ---p 000eb000 67:70864 144115273706872383      /lustre/shared/ucl/apps/gcc/4.9.2/lib64/libstdc++.so.6.0.20
    2af5537e7000-2af5537ef000 r--p 000eb000 67:70864 144115273706872383      /lustre/shared/ucl/apps/gcc/4.9.2/lib64/libstdc++.so.6.0.20
    2af5537ef000-2af5537f1000 rw-p 000f3000 67:70864 144115273706872383      /lustre/shared/ucl/apps/gcc/4.9.2/lib64/libstdc++.so.6.0.20
    2af5537f1000-2af553806000 rw-p 00000000 00:00 0
    2af553806000-2af553907000 r-xp 00000000 08:03 148444                     /usr/lib64/libm-2.17.so
    2af553907000-2af553b06000 ---p 00101000 08:03 148444                     /usr/lib64/libm-2.17.so
    2af553b06000-2af553b07000 r--p 00100000 08:03 148444                     /usr/lib64/libm-2.17.so
    2af553b07000-2af553b08000 rw-p 00101000 08:03 148444                     /usr/lib64/libm-2.17.so
    2af553b08000-2af553b1e000 r-xp 00000000 67:70864 144115273706872320      /lustre/shared/ucl/apps/gcc/4.9.2/lib64/libgcc_s.so.1
    2af553b1e000-2af553d1d000 ---p 00016000 67:70864 144115273706872320      /lustre/shared/ucl/apps/gcc/4.9.2/lib64/libgcc_s.so.1
    2af553d1d000-2af553d1e000 r--p 00015000 67:70864 144115273706872320      /lustre
    /shared/ucl/apps/gcc/4.9.2/lib64/libgcc_s.so.1
    2af553d1e000-2af553d1f000 rw-p 00016000 67:70864 144115273706872320      /lustre
    /shared/ucl/apps/gcc/4.9.2/lib64/libgcc_s.so.1
    2af553d1f000-2af553ee3000 r-xp 00000000 08:03 148170                     /usr/li
    b64/libc-2.17.so
    2af553ee3000-2af5540e2000 ---p 001c4000 08:03 148170                     /usr/li
    b64/libc-2.17.so
    2af5540e2000-2af5540e6000 r--p 001c3000 08:03 148170                     /usr/li
    b64/libc-2.17.so
    2af5540e6000-2af5540e8000 rw-p 001c7000 08:03 148170                     /usr/li
    b64/libc-2.17.so
    2af5540e8000-2af5540ed000 rw-p 00000000 00:00 0
    2af5540ed000-2af554104000 r-xp 00000000 08:03 148564                     /usr/li
    b64/libpthread-2.17.so
    2af554104000-2af554303000 ---p 00017000 08:03 148564                     /usr/li
    b64/libpthread-2.17.so
    2af554303000-2af554304000 r--p 00016000 08:03 148564                     /usr/li
    b64/libpthread-2.17.so
    2af554304000-2af554305000 rw-p 00017000 08:03 148564                     /usr/lib64/libpthread-2.17.so
    2af554305000-2af554309000 rw-p 00000000 00:00 0
    2af558000000-2af558021000 rw-p 00000000 00:00 0
    2af558021000-2af55c000000 ---p 00000000 00:00 0
    7ffe2293b000-7ffe2295f000 rw-p 00000000 00:00 0                          [stack]
    7ffe22976000-7ffe22978000 r-xp 00000000 00:00 0                          [vdso]
    ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]
    CatDCD 5.1
    Reading indices from file 'indexfile'
    dcdplugin) detected standard 32-bit DCD file of native endianness
    dcdplugin) CHARMM format DCD file (also NAMD 2.1 and later)
    Opening file 'all_combined.dcd' for writing.
    dcdplugin) detected standard 32-bit DCD file of native endianness
    dcdplugin) CHARMM format DCD file (also NAMD 2.1 and later)
    Opened file 'MD_backbone_Wildtype_1.dcd' for reading.
    

    Any ideas on what this error means or how to solve it? Catdcd ends up running and joining the files anyway, but I assume that error at the beginning has something to do with the issues I am having.

  10. Christos Efthymiou reporter

    I resolved the issue, there was some problem with the merged dcd file. I made it again and it seems to be working now even though I did the same thing. Thanks!

  11. Log in to comment