gpu error with relion_refine_mpi (3D auto refine)
Issue #21
resolved
Run using GUI, built from commit 9a025627f881, via slurm. Double precision CPU, single GPU.
The command fails regardless of the number of gpus.
All gpus are the same, Tesla P100, and cuda version is 9.1.85.
The same job with the same data ran successfully at least to iteration 2 when using CPU.
in: /software/build-tmp/RELION/ja180825_3.0_beta-9a025627f881/foss-2017a-CUDA-9.1.85/scheres-relion-3.0_beta-9a025627f881/src/acc/cuda/cuda_settings.h, line 81
in: /software/build-tmp/RELION/ja180825_3.0_beta-9a025627f881/foss-2017a-CUDA-9.1.85/scheres-relion-3.0_beta-9a025627f881/src/acc/cuda/cuda_settings.h, line 81
slave 2 encountered error: === Backtrace ===
/software/171020/software/relion/ja180825_3.0_beta-9a025627f881-foss-2017a-cuda-9.1.85/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x66) [0x43ae26]
/software/171020/software/relion/ja180825_3.0_beta-9a025627f881-foss-2017a-cuda-9.1.85/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesR14ThreadArgument+0xec) [0x5c331c]
/software/171020/software/relion/ja180825_3.0_beta-9a025627f881-foss-2017a-cuda-9.1.85/bin/relion_refine_mpi(_Z11_threadMainPv+0x3e) [0x48066e]
/lib64/libpthread.so.0(+0x7e25) [0x2aaaaacd6e25]
/lib64/libc.so.6(clone+0x6d) [0x2aaab72b434d]
==================
ERROR:
A GPU-function failed to execute.
Comments (3)
-
-
reporter Hi Bjoern,
it's highly likely it was an error on my side, most probably I ran it on the wrong node. Everything is working now. Sorry, I should have reported back.
Best,
Juraj
-
reporter - changed status to resolved
bug not reproducible
- Log in to comment
Are you sure you are running on a node which has GPUs? There should be a mapping of threads to GPUs in the initial output - what does it say?