simfactory rev 1677 fail to start on lonestar

Issue #1168 closed

Roland Haas created an issue 2012-11-05

It seems as if

module load TACC cuda cuda_SDK

fails with the warning

after which simfactory fails to continue. My suspicion is that cuda is not available for OpenMPI (which is the selected MPI stack) and the module then returns and error and simfactory aborts if the env command does not succeed. The output I get is:

==> qc0_cuda.err <==
+ cd /work/00945/rhaas/ET_trunk
+ /work/00945/rhaas/ET_trunk/simfactory/bin/sim run qc0_cuda --machine=lonestar --restart-id=0
Inactive Modules:
  1) cuda     2) cuda_SDK

Lmod Warning: Did not find: cuda cuda_SDK

Try: "module spider cuda cuda_SDK"

==> qc0_cuda.out <==
TACC: Setting memory limits for job 842190 to unlimited KB
TACC: Dumping job script:
--------------------------------------------------------------------------------
#! /bin/bash
#$ -A TG-PHY100033
#$ -q normal
#$ -r n
#$ -l h_rt=0:15:00
#$ -pe 2way 36
#$ 
#$ -V
#$ -N qc0_cuda-0000
#$ -M rhaas
#$ -m abe
#$ -o /scratch/00945/rhaas/simulations/qc0_cuda/output-0000/qc0_cuda.out
#$ -e /scratch/00945/rhaas/simulations/qc0_cuda/output-0000/qc0_cuda.err
set -x
cd /work/00945/rhaas/ET_trunk
/work/00945/rhaas/ET_trunk/simfactory/bin/sim run qc0_cuda --machine=lonestar --restart-id=0 
--------------------------------------------------------------------------------
TACC: Done.
Simulation name: qc0_cuda
Running simulation qc0_cuda
Mon Nov  5 23:03:27 CST 2012
Simfactory Done at date: 0
TACC: Cleaning up after job: 842190
TACC: Done.

Removing this command from envsetup lets me run. However since I do not run CUDA (the commit claims OpenCL which also seems wrong), does someone who uses OpenCL/CUDA on lonestar want to suggest an alternative?

Keyword:

Comments (9)

Frank Löffler

removed comment

It is possible to load openmpi and cuda:

$ module list
Currently Loaded Modules:
  1) mkl/10.3      4) Linux            7) intel/11.1     10) openmpi/1.4.3
  2) TACC          5) cluster          8) gzip/1.3.12    11) cuda/5.0
  3) TACC-paths    6) cluster-paths    9) tar/1.22       12) cuda_SDK/5.0

However, module load TACC loads mvapich2:

$ module load TACC cuda cuda_SDK
$ module list
Currently Loaded Modules:
  1) mkl/10.3      4) Linux            7) intel/11.1      10) tar/1.22
  2) TACC          5) cluster          8) mvapich2/1.6    11) cuda/5.0
  3) TACC-paths    6) cluster-paths    9) gzip/1.3.12     12) cuda_SDK/5.0

So, if we want to use openmpi we have to specifically unload it after loading TACC.

However, I never manages to get a 'Did not find: cuda cuda_SDK' error, so I leave further testing to Roland.

2012-11-05T23:38:52+00:00

Frank Löffler
- changed milestone to ET_2012_11
- removed comment
- 2012-11-05T23:40:11+00:00
Roland Haas reporter
- removed comment
I do not get this error on the head node, only in the envsetup that runs on the staging node. In particular, if I add a "module avail" to envsetup then the output does not contain cuda:
```
envsetup        = source /etc/profile.d/tacc_modules.sh && module avail && module unload mvapich2 && module unload openmpi && module load TACC cuda cuda_SDK
```
Is it possible that we should not try to change the modules on the compute nodes but instead only on the head node before submission/compilation?
- 2012-11-06T10:04:18+00:00
Erik Schnetter
- removed comment
We need to load the same modules all the time, to ensure that the executable has access to the same shared libraries. Since some shared libraries may depend on other shared libraries, it is otherwise very difficult to ensure that the right shared libraries are found.

I suggest to not load CUDA by default, and to add a second option list that loads CUDA as well. Maybe CUDA is only available on some of the compute nodes, and one needs to explicitly request these when submitting a job.
- 2012-11-06T10:07:20+00:00
Roland Haas reporter
- changed status to resolved
- removed comment
ok. I'll add a comment how to enable cuda to lonestar.ini. Right now we always seem to have precisely one machine.ini file per supported machine and not several dependent on the option choses. The error goes indeed away if I submit to the gpu queue (--queue gpu).
- 2012-11-06T10:38:42+00:00
Erik Schnetter
- changed status to open
- marked as
- removed milestone
- removed comment
Not quite; e.g. on Datura (AEI) we also have multiple option lists, depending on whether the GPU is used or not. I suggest to add a second option list "lonestar-gpu", which would allow people to switch forth and back without modifying the option list and submit script.
- 2012-11-06T10:57:19+00:00
Roland Haas reporter
- removed comment
alright. Created a copy of lonestar.ini in lonestar-gpu.ini with cuda enabled and the default queue changed to gpu.
- 2012-11-06T11:07:28+00:00
Roland Haas reporter
- changed status to resolved
- removed comment
Split lonestar.ini in simfactory rev 1843.
- 2012-11-06T11:30:55+00:00
Roland Haas reporter
- changed status to closed
- edited description
- 2019-02-21T19:52:18+00:00
Log in to comment

Assignee: Erik Schnetter

Type: bug

Priority: minor

Status: closed

Component: SimFactory

Milestone: –

Version: –

Votes: 0

Watchers: 0