simfactory rev 1677 fail to start on lonestar

Create issue
Issue #1168 closed
Roland Haas created an issue

It seems as if

module load TACC cuda cuda_SDK

fails with the warning

after which simfactory fails to continue. My suspicion is that cuda is not available for OpenMPI (which is the selected MPI stack) and the module then returns and error and simfactory aborts if the env command does not succeed. The output I get is:

==> qc0_cuda.err <==
+ cd /work/00945/rhaas/ET_trunk
+ /work/00945/rhaas/ET_trunk/simfactory/bin/sim run qc0_cuda --machine=lonestar --restart-id=0
Inactive Modules:
  1) cuda     2) cuda_SDK

Lmod Warning: Did not find: cuda cuda_SDK

Try: "module spider cuda cuda_SDK"

==> qc0_cuda.out <==
TACC: Setting memory limits for job 842190 to unlimited KB
TACC: Dumping job script:
--------------------------------------------------------------------------------
#! /bin/bash
#$ -A TG-PHY100033
#$ -q normal
#$ -r n
#$ -l h_rt=0:15:00
#$ -pe 2way 36
#$ 
#$ -V
#$ -N qc0_cuda-0000
#$ -M rhaas
#$ -m abe
#$ -o /scratch/00945/rhaas/simulations/qc0_cuda/output-0000/qc0_cuda.out
#$ -e /scratch/00945/rhaas/simulations/qc0_cuda/output-0000/qc0_cuda.err
set -x
cd /work/00945/rhaas/ET_trunk
/work/00945/rhaas/ET_trunk/simfactory/bin/sim run qc0_cuda --machine=lonestar --restart-id=0 
--------------------------------------------------------------------------------
TACC: Done.
Simulation name: qc0_cuda
Running simulation qc0_cuda
Mon Nov  5 23:03:27 CST 2012
Simfactory Done at date: 0
TACC: Cleaning up after job: 842190
TACC: Done.

Removing this command from envsetup lets me run. However since I do not run CUDA (the commit claims OpenCL which also seems wrong), does someone who uses OpenCL/CUDA on lonestar want to suggest an alternative?

Keyword:

Comments (9)

  1. Frank Löffler
    • removed comment

    It is possible to load openmpi and cuda:

    $ module list
    Currently Loaded Modules:
      1) mkl/10.3      4) Linux            7) intel/11.1     10) openmpi/1.4.3
      2) TACC          5) cluster          8) gzip/1.3.12    11) cuda/5.0
      3) TACC-paths    6) cluster-paths    9) tar/1.22       12) cuda_SDK/5.0
    

    However, module load TACC loads mvapich2:

    $ module load TACC cuda cuda_SDK
    $ module list
    Currently Loaded Modules:
      1) mkl/10.3      4) Linux            7) intel/11.1      10) tar/1.22
      2) TACC          5) cluster          8) mvapich2/1.6    11) cuda/5.0
      3) TACC-paths    6) cluster-paths    9) gzip/1.3.12     12) cuda_SDK/5.0
    

    So, if we want to use openmpi we have to specifically unload it after loading TACC.

    However, I never manages to get a 'Did not find: cuda cuda_SDK' error, so I leave further testing to Roland.

  2. Roland Haas reporter
    • removed comment

    I do not get this error on the head node, only in the envsetup that runs on the staging node. In particular, if I add a "module avail" to envsetup then the output does not contain cuda:

    envsetup        = source /etc/profile.d/tacc_modules.sh && module avail && module unload mvapich2 && module unload openmpi && module load TACC cuda cuda_SDK
    

    Is it possible that we should not try to change the modules on the compute nodes but instead only on the head node before submission/compilation?

  3. Erik Schnetter
    • removed comment

    We need to load the same modules all the time, to ensure that the executable has access to the same shared libraries. Since some shared libraries may depend on other shared libraries, it is otherwise very difficult to ensure that the right shared libraries are found.

    I suggest to not load CUDA by default, and to add a second option list that loads CUDA as well. Maybe CUDA is only available on some of the compute nodes, and one needs to explicitly request these when submitting a job.

  4. Roland Haas reporter
    • changed status to resolved
    • removed comment

    ok. I'll add a comment how to enable cuda to lonestar.ini. Right now we always seem to have precisely one machine.ini file per supported machine and not several dependent on the option choses. The error goes indeed away if I submit to the gpu queue (--queue gpu).

  5. Erik Schnetter
    • changed status to open
    • marked as
    • removed milestone
    • removed comment

    Not quite; e.g. on Datura (AEI) we also have multiple option lists, depending on whether the GPU is used or not. I suggest to add a second option list "lonestar-gpu", which would allow people to switch forth and back without modifying the option list and submit script.

  6. Roland Haas reporter
    • removed comment

    alright. Created a copy of lonestar.ini in lonestar-gpu.ini with cuda enabled and the default queue changed to gpu.

  7. Log in to comment