gpu error with relion_refine_mpi (3D auto refine)

Issue #21 resolved
Juraj Ahel created an issue

Run using GUI, built from commit 9a025627f881, via slurm. Double precision CPU, single GPU.

The command fails regardless of the number of gpus.

All gpus are the same, Tesla P100, and cuda version is 9.1.85.

The same job with the same data ran successfully at least to iteration 2 when using CPU.

in: /software/build-tmp/RELION/ja180825_3.0_beta-9a025627f881/foss-2017a-CUDA-9.1.85/scheres-relion-3.0_beta-9a025627f881/src/acc/cuda/cuda_settings.h, line 81
in: /software/build-tmp/RELION/ja180825_3.0_beta-9a025627f881/foss-2017a-CUDA-9.1.85/scheres-relion-3.0_beta-9a025627f881/src/acc/cuda/cuda_settings.h, line 81
slave 2 encountered error: === Backtrace  ===
/software/171020/software/relion/ja180825_3.0_beta-9a025627f881-foss-2017a-cuda-9.1.85/bin/relion_refine_mpi(_ZN11RelionErrorC2ERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEES7_l+0x66) [0x43ae26]
/software/171020/software/relion/ja180825_3.0_beta-9a025627f881-foss-2017a-cuda-9.1.85/bin/relion_refine_mpi(_Z36globalThreadExpectationSomeParticlesR14ThreadArgument+0xec) [0x5c331c]
/software/171020/software/relion/ja180825_3.0_beta-9a025627f881-foss-2017a-cuda-9.1.85/bin/relion_refine_mpi(_Z11_threadMainPv+0x3e) [0x48066e]
/lib64/libpthread.so.0(+0x7e25) [0x2aaaaacd6e25]
/lib64/libc.so.6(clone+0x6d) [0x2aaab72b434d]
==================

ERROR: 

A GPU-function failed to execute.

Comments (3)

  1. Björn Forsberg

    Are you sure you are running on a node which has GPUs? There should be a mapping of threads to GPUs in the initial output - what does it say?

  2. Juraj Ahel reporter

    Hi Bjoern,

    it's highly likely it was an error on my side, most probably I ran it on the wrong node. Everything is working now. Sorry, I should have reported back.

    Best,

    Juraj

  3. Log in to comment