- removed comment
simfactory rev 1677 fail to start on lonestar
It seems as if
module load TACC cuda cuda_SDK
fails with the warning
after which simfactory fails to continue. My suspicion is that cuda is not available for OpenMPI (which is the selected MPI stack) and the module then returns and error and simfactory aborts if the env command does not succeed. The output I get is:
==> qc0_cuda.err <==
+ cd /work/00945/rhaas/ET_trunk
+ /work/00945/rhaas/ET_trunk/simfactory/bin/sim run qc0_cuda --machine=lonestar --restart-id=0
Inactive Modules:
1) cuda 2) cuda_SDK
Lmod Warning: Did not find: cuda cuda_SDK
Try: "module spider cuda cuda_SDK"
==> qc0_cuda.out <==
TACC: Setting memory limits for job 842190 to unlimited KB
TACC: Dumping job script:
--------------------------------------------------------------------------------
#! /bin/bash
#$ -A TG-PHY100033
#$ -q normal
#$ -r n
#$ -l h_rt=0:15:00
#$ -pe 2way 36
#$
#$ -V
#$ -N qc0_cuda-0000
#$ -M rhaas
#$ -m abe
#$ -o /scratch/00945/rhaas/simulations/qc0_cuda/output-0000/qc0_cuda.out
#$ -e /scratch/00945/rhaas/simulations/qc0_cuda/output-0000/qc0_cuda.err
set -x
cd /work/00945/rhaas/ET_trunk
/work/00945/rhaas/ET_trunk/simfactory/bin/sim run qc0_cuda --machine=lonestar --restart-id=0
--------------------------------------------------------------------------------
TACC: Done.
Simulation name: qc0_cuda
Running simulation qc0_cuda
Mon Nov 5 23:03:27 CST 2012
Simfactory Done at date: 0
TACC: Cleaning up after job: 842190
TACC: Done.
Removing this command from envsetup lets me run. However since I do not run CUDA (the commit claims OpenCL which also seems wrong), does someone who uses OpenCL/CUDA on lonestar want to suggest an alternative?
Keyword:
Comments (9)
-
-
- changed milestone to ET_2012_11
- removed comment
-
reporter - removed comment
I do not get this error on the head node, only in the envsetup that runs on the staging node. In particular, if I add a "module avail" to envsetup then the output does not contain cuda:
envsetup = source /etc/profile.d/tacc_modules.sh && module avail && module unload mvapich2 && module unload openmpi && module load TACC cuda cuda_SDK
Is it possible that we should not try to change the modules on the compute nodes but instead only on the head node before submission/compilation?
-
- removed comment
We need to load the same modules all the time, to ensure that the executable has access to the same shared libraries. Since some shared libraries may depend on other shared libraries, it is otherwise very difficult to ensure that the right shared libraries are found.
I suggest to not load CUDA by default, and to add a second option list that loads CUDA as well. Maybe CUDA is only available on some of the compute nodes, and one needs to explicitly request these when submitting a job.
-
reporter - changed status to resolved
- removed comment
ok. I'll add a comment how to enable cuda to lonestar.ini. Right now we always seem to have precisely one machine.ini file per supported machine and not several dependent on the option choses. The error goes indeed away if I submit to the gpu queue (--queue gpu).
-
- changed status to open
- marked as
- removed milestone
- removed comment
Not quite; e.g. on Datura (AEI) we also have multiple option lists, depending on whether the GPU is used or not. I suggest to add a second option list "lonestar-gpu", which would allow people to switch forth and back without modifying the option list and submit script.
-
reporter - removed comment
alright. Created a copy of lonestar.ini in lonestar-gpu.ini with cuda enabled and the default queue changed to gpu.
-
reporter - changed status to resolved
- removed comment
Split lonestar.ini in simfactory rev 1843.
-
reporter - changed status to closed
- edited description
- Log in to comment
It is possible to load openmpi and cuda:
However,
module load TACC
loads mvapich2:So, if we want to use openmpi we have to specifically unload it after loading TACC.
However, I never manages to get a 'Did not find: cuda cuda_SDK' error, so I leave further testing to Roland.