InitialModel can't run on a single GPU, gives non-informative error

Issue #20 resolved
Juraj Ahel created an issue

Run using GUI, built from commit 9a025627f881, via slurm.

Running the command with 3 GPUs instead went well.

GPU:

Screen Shot 08-28-18 at 05.57 PM.PNG

Command:

`which relion_refine` --o InitialModel/job022/run --sgd_ini_iter 50 --sgd_inbetween_iter 200 --sgd_fin_iter 50 --sgd_write_iter 10 --sgd_ini_resol 35 --sgd_fin_resol 15 --sgd_ini_subset 100 --sgd_fin_subset 500 --sgd  --denovo_3dref --i Import/job015/particles.star --ctf --K 1 --sym C1 --flatten_solvent  --zero_mask  --dont_combine_weights_via_disc --pool 30 --pad 2  --particle_diameter 350 --oversampling 1 --healpix_order 1 --offset_range 6 --offset_step 4 --j 1 --gpu ""

Error below:

ERROR: no CUDA-capable device is detected in /software/build-tmp/RELION/ja180825_3.0_beta-9a025627f881/foss-2017a-CUDA-9.1.85/scheres-relion-3.0_beta-9a025627f881/src/ml_optimiser.cpp at line 1128 (error-code 38)
srun: error: cn-29: task 0: Exited with exit code 1
in: /software/build-tmp/RELION/ja180825_3.0_beta-9a025627f881/foss-2017a-CUDA-9.1.85/scheres-relion-3.0_beta-9a025627f881/src/acc/cuda/cuda_settings.h, line 67
=== Backtrace  ===
/software/171020/software/relion/ja180825_3.0_beta-9a025627f881-foss-2017a-cuda-9.1.85/bin/relion_refine(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x66) [0x4396e6]
/software/171020/software/relion/ja180825_3.0_beta-9a025627f881-foss-2017a-cuda-9.1.85/bin/relion_refine() [0x450820]
/software/171020/software/relion/ja180825_3.0_beta-9a025627f881-foss-2017a-cuda-9.1.85/bin/relion_refine(_ZN11MlOptimiser10initialiseEv+0x49) [0x47a4d9]
/software/171020/software/relion/ja180825_3.0_beta-9a025627f881-foss-2017a-cuda-9.1.85/bin/relion_refine(main+0x33) [0x429eb3]
/lib64/libc.so.6(__libc_start_main+0xf5) [0x2aaab71ddc05]
/software/171020/software/relion/ja180825_3.0_beta-9a025627f881-foss-2017a-cuda-9.1.85/bin/relion_refine() [0x42c1df]
==================
ERROR: 

A GPU-function failed to execute.

 If this occured at the start of a run, you might have GPUs which
are incompatible with either the data or your installation of relion.
If you 

    -> INSTALLED RELION YOURSELF: if you e.g. specified -DCUDA_ARCH=50
       and are trying ot run on a compute 3.5 GPU (-DCUDA_ARCH=3.5), 
       this may happen.

    -> HAVE MULTIPLE GPUS OF DIFFERNT VERSIONS: relion needs GPUS with
       at least compute 3.5. You may be trying to use a GPU older than
       this. If you have multiple generations, try specifying --gpu <X>
       with X=0. Then try X=1 in a new run, and so on. The numbering of
       GPUs may not be obvious from the driver or intuition. For a list
       of GPU compute generations, see 

       en.wikipedia.org/wiki/CUDA#Version_features_and_specifications

    -> ARE USING DOUBLE-PRECISION GPU CODE: relion was been written so
       as to not require this, and may thus have unforeseen requirements
       when run in this mode. If you think it is nonetheless necessary,
       please consult the developers with this error.

If this occurred at the middle or end of a run, it might be that

    -> YOUR DATA OR PARAMETERS WERE UNEXPECTED: execution on GPUs is 
       subject to many restrictions, and relion is written to work within
       common restraints. If you have exotic data or settings, unexpected
       configurations may occur. See also above point regarding 
       double precision.
If none of the above applies, please report the error to the relion
developers at    github.com/3dem/relion/issues

Comments (3)

  1. Björn Forsberg

    Your CUDA runtime is not seeing your GPU. The driver seems ok since nvidia-smi reports it, but for some reason the runtime doesn't. If you can get any process to run on it, relion will as well. I'd try a reboot, if you haven't already.

  2. Juraj Ahel reporter

    Indeed, I found out it was an error in the submission script in the end and it wasn't allocating a gpu to the job. Many thanks!

  3. Log in to comment