Error in reading netcdf file + Memory issues with fit.xyz()

Issue #77 resolved
rohit_arora created an issue

Hello, I have two questions: 1) I am trying to read in a large trajectory file (50000 frames) generated by Amber12. The original format of the file is mdcrd, but since there is no function in bio3d to read this format, I convert it to netcdf format using ptraj. But whenever I try to read this file using read.ncdf function, I get the following:

 traj <- read.ncdf("trj.netcdf", time=TRUE, verbose=TRUE)
[1] "Reading file trj.netcdf"
[1] "Produced by program: cpptraj"
[1] "File conventions AMBER version 1.0"
[1] "Frames: 50001"
[1] "Atoms: 4542"
Error: cannot allocate vector of size 5.1 Gb

Unfortunately, there is no multicore option in read.ncdf() (or is there?) so I cannot seem to read big netcdf files. I have tried to reboot R and even used at.sel option for only backbone atoms. Is there any workaround for this?

2) Due to the first issue, as discussed in the documentation I converted my trajectory file into the dcd file using VMD, which works fine. But now when I want to use fit.xyz() in order to fit my trajectory to a pdb structure (only backbone atoms), it takes forever and the system hangs, even with maximum number of cores (ncore=8). Ultimately I get the error:

Error in fork() : Unable to fork R

Are there optimal values for ncore and nseg.scale which would make this work? Or would I be better off running it on a cluster with many more number of processors.

System/R Session info:

> sessionInfo(package="bio3d")
R version 3.0.2 (2013-09-25)
Platform: x86_64-redhat-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C             
 [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8    
 [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8   
 [7] LC_PAPER=en_US.utf8       LC_NAME=C                
 [9] LC_ADDRESS=C              LC_TELEPHONE=C           
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C      

attached base packages:
character(0)

other attached packages:
[1] bio3d_2.0-1

loaded via a namespace (and not attached):
 [1] base_3.0.2          BH_1.51.0-4         bigmemory.sri_0.1.2
 [4] datasets_3.0.2      graphics_3.0.2      grDevices_3.0.2    
 [7] methods_3.0.2       ncdf_1.6.6          parallel_3.0.2     
[10] stats_3.0.2         tools_3.0.2         utils_3.0.2        
[13] XML_3.98-1.1       

I have already spent over 2 days trying to fix these issues, so any help would be much appreciated!

Thank you

Best regards

Rohit

Comments (7)

  1. Lars Skjærven

    Hi Rohit, These big trajectory files can certainly be a problem since R will attempt to read the entire trajectory into memory. I would therefore consider to reduce the size of the input trajectory (e.g. you could maybe divide it into several parts, or do

    strip '* &! @CA'
    

    in ptraj?).

    Alternatively, as you suggested, use the at.sel argument. see examples below:

    trj <- read.ncdf("14_production/01_classA/hei.nc")
    Loading required package: ncdf
    [1] "Reading file 14_production/01_classA/hei.nc"
    [1] "Produced by program: sander"
    [1] "File conventions AMBER version 1.0"
    [1] "Frames: 15000"
    [1] "Atoms: 30744"
    Error: cannot allocate vector of size 10.3 Gb
    
    pdb <- read.pdb("11_build_TIP4P/prot_sol.amb.pdb")
    sele <- atom.select(pdb, "calpha")
    trj <- read.ncdf("14_production/01_classA/hei.nc", at.sel=sele)
    dim(trj)
    [1] 15000   387
    
    xyz <- fit.xyz(trj[1,], trj)
    

    setting nseg.scale > 1 does not help?

    See also arguments first and last:

    trj = read.ncdf("14_production/01_classA/hei.nc", first=1, last=10)
    dim(trj)
    [1]    10 92232
    

    Hope it helps. L

  2. Xinqiu Yao

    Hi Rohit,

    To your first question:

    read.ncdf doesn't support multicore (neither ncore nor nseg.scale works). As Lars suggested, you may need to reduce file size before reading it in R, or with at.sel option. It didn't work with at.sel at your first attempt possibly because it basically reads whole data and then applies atom selection. So, one way is to split the single big file into several smaller files and then read all of them once with the command,

    trj <- read.ncdf(c(files_names), at.sel=sele)
    

    The second question:

    It usually uses more RAM with ncore > 1 and so possibly caused the system hangs in your case. Have you tried with ncore=1? Alternatively, you can set nseg.scale>1 to split data beforehand, e.g. ncore=8 and nseg.scale=20.

    Btw, I don't think it would help to run R on a cluster, because the multicore in Bio3D doesn't support parallelization with CPU cores on different nodes.

    Let me know if you still have problems. Good luck!

  3. rohit_arora reporter

    Dear Lars and Xin-Qiu,

    Thank you both for your very useful comments.

    Lars:

    Your first suggestion:

    pdb <- read.pdb("11_build_TIP4P/prot_sol.amb.pdb")
    sele <- atom.select(pdb, "calpha")
    trj <- read.ncdf("14_production/01_classA/hei.nc", at.sel=sele)
    dim(trj)
    [1] 15000   387
    

    This still doesn't work :/ Same error. What are your system specs? That might be the difference.

    Your second suggestion:

    trj = read.ncdf("14_production/01_classA/hei.nc", first=1, last=10)
    dim(trj)
    [1]    10 92232
    

    This seems like an interesting proposal, but could you please tell me as to how to go about adding up all the frames into one single trajectory matrix, after all of them have been read in in multiple parts? R is not really known for its loops, so that might not be the best idea. We tried something similar with making lists of the trajectory matrix and working with chunks in the list for various functions which will also output lists. But from the viewpoint of structural biology, that might not be correct especially when calculating average structure etc. Also, the resulting RMSD plots also seemed a little off. Perhaps I can take a look at this again. Do let me know what you think about this?

    Xin-Qiu

    This solution does seem rather elegant, if only you could please tell me how to input multiple files in the command you provided (the syntax). Say I have 5 files: traj1.netcdf, traj2.netcdf...traj5.netcdf

    My apologies if my questions sound naive, but I am not exactly a pro in R. Its only Bio3D that has compelled me to work with R and quite frankly it has been a good learning experience so far!

    Thanks again to both of you for your help.

    Regards Rohit

  4. Xinqiu Yao

    Hi Rohit,

    The command is very simple: Suppose your PDB file used for MD simulation (having exactly the same number of atoms to that in the trajectory file) is "top.pdb"

    pdb <- read.pdb("top.pdb")
    sele <- atom.select(pdb, "calpha") # choose CA atoms only
    # sele <- atom.select(pdb, "backbone") # Or choose backbone atoms
    
    files <- paste("traj", 1:5, ".netcdf", sep="")
    trj <- read.ncdf(files, at.sel=sele)
    

    Note that check the location of your files to make sure they are under the same directory where you start R. Otherwise, you need add path to both PDB and trajectory files, e.g. files <- file.path(your_path, files)

  5. Lars Skjærven

    you could do something wicked like:

    library(bio3d)
    
    # read in PDB (same atom numbering as netcdf file)
    pdb <- read.pdb("11_build_TIP4P/prot_sol.amb.pdb")
    sele <- atom.select(pdb, "calpha")
    
    # allocate vector and matrix for rmsd and fitted CA crds
    all.rmsd <- rep(NA, 1500)
    ca.trj <- matrix(NA,  nrow=1500, ncol=length(sele$xyz))
    
    step <- 10;
    for ( i in 1:150 ) {
      first <- ((i-1)*step)+1; last <- first+step-1;
      trj = read.ncdf("14_production/01_classA/hei.nc", first=first, last=last)
    
      all.rmsd[first:last] = rmsd(pdb$xyz[sele$xyz], trj[,sele$xyz], fit=TRUE)
      ca.trj[first:last, ] = fit.xyz(pdb$xyz[sele$xyz], trj[,sele$xyz])
    }
    
    rd <- rmsd(pdb$xyz[sele$xyz], ca.trj)
    
    trj2 <- read.ncdf("14_production/01_classA/hei.nc",
                      at.sel=sele, first=1, last=1500)
    rd2 <- rmsd(pdb$xyz[sele$xyz], trj2, fit=TRUE)
    
    identical(rd, rd2)
    [1] TRUE
    

    with a low "step" you'll avoid some heavy memory usage.

  6. rohit_arora reporter

    Dear Lars and Xin-Quo, My sincere apologies for responding with much delay, I had been busy with other projects. The last 'wicked' solution provided by Lars works perfectly! Thank you very much for your help.

  7. Log in to comment