get.pdb() hang with gzip=TRUE and ncore=X

Issue #49 resolved
Barry Grant created an issue

I have just observed the following potential bug when using both gzip and ncore options to speed up get.pdb().

# Find open and closed states for lysozome...
library(bio3d)
pdb <- read.pdb("1hel")

blast <- blast.pdb( pdbseq(pdb) )
hits <- plot(blast, cutoff=160)

##- N.B. Hangs after initial download for unknown reason and needs to be killed...
raw.files <- get.pdb(hits$pdb.id, path ="lys_pdbs", gzip=TRUE, ncore=8)

##- However this works fine
unlink("lys_pdbs/*")
raw.files <- get.pdb(hits$pdb.id, path ="lys_pdbs", gzip=TRUE)

##- This also works fine...
unlink("test/*")
raw.files <- get.pdb(hits$pdb.id, path ="test", ncore=8)


## Also works fine
raw.files <- get.pdb(hits$pdb.id[1:100], path ="test", gzip=TRUE, ncore=8)

## This exits with missing downloads...
unlink("test/*")
raw.files <- get.pdb(hits$pdb.id[201:500], path ="test", gzip=TRUE, ncore=8)
## Warning messages:
## 1: In get.pdb(hits$pdb.id[201:500], path = "test", gzip = TRUE, ncore = 8) :
##  ids should be standard 4 character PDB-IDs: trying first 4 characters...
## 2: In mclapply(1:length(pdb.files), function(k) { :
##  scheduled cores 582 encountered errors in user code, all values of the jobs will be affected

all(file.exists(raw.files))
## FALSE

sum(!file.exists(raw.files))
##[1] 90 # lots of missing files...

Comments (3)

  1. Xinqiu Yao

    I got similarly problem. But it is weird that the error happens quite randomly: Sometimes I can get through with all files downloaded but sometimes the job was done with missing files (70 in my case). I guess that is because of the response of PDB server (they may have some rules on multithread download), but I am not sure. Will check it with more details...

  2. Xinqiu Yao

    I think ncore=8 is still too large for PDB server. I tried ncore=4 and it seems work fine.

    system.time(raw.files <- get.pdb(hits$pdb.id, path ="lys_pdbs", gzip=TRUE, ncore=4))
    #   user  system elapsed 
    # 1.560   3.383  28.186 
    
    all(file.exists(raw.files))
    # TRUE
    
    unlink("lys_pdbs/*")
    system.time(raw.files <- get.pdb(hits$pdb.id, path ="lys_pdbs", gzip=TRUE))
    #   user  system elapsed 
    #  1.645   3.287  94.904 
    
    all(file.exists(raw.files))
    # TRUE
    

    So, the maximum ncore is set to 4. Let me know if there is still problem, and then we may think of removing the parallel part...

  3. Log in to comment