Inconsistency in Simfactory submission scripts

Issue #2482 new
Steven R. Brandt created an issue

Note that the following are all slurm machines. Most of them configure --ntasks-per-node=@NODE_PROCS@ which seems to match the simfactory documentation.

MarconiA3 uses @(@NUM_PROCS@/@NODES@)@ which ought to be the same as @NODE_PROCS@. However, sciama and comet seem to just use a wrong value which might work some of the time.

repos/simfactory2/mdb/submitscripts/minerva.sub:5: #SBATCH --ntasks-per-node=@NODE_PROCS@ repos/simfactory2/mdb/submitscripts/holodeck.sub:6: #SBATCH --ntasks-per-node=@NODE_PROCS@ repos/simfactory2/mdb/submitscripts/sciama.sub:5: #SBATCH --tasks-per-node=@PPN@ repos/simfactory2/mdb/submitscripts/comet.sub:7: #SBATCH --ntasks-per-node @PPN_USED@ repos/simfactory2/mdb/submitscripts/draco.sub:6: #SBATCH --ntasks-per-node=@NODE_PROCS@ repos/simfactory2/mdb/submitscripts/marconiA3.sub:8: #SBATCH --ntasks-per-node @(@NUM_PROCS@/@NODES@)@

Comments (15)

  1. Steven R. Brandt reporter

    Correction, @(@NUM_PROCS@/@NODES@)@ is only @NODE_PROCS@ when @NUM_SMT@=1. Maybe that’s being assumed?

  2. Erik Schnetter

    NUM_SMT was introduced late in the game. On many systems it is always 1. It is quite likely that there are old machine files that don’t take NUM_SMT into account, or which were copied to create newer machine files that then have the same problem.

  3. Roland Haas

    SLURM’s task notion is site specific. Eg for Comet the SDSC docs https://www.sdsc.edu/support/user_guides/comet.html#running give for a hybrid MPI/OpenMP job:

    #!/bin/bash
    #SBATCH --job-name="hellohybrid"
    #SBATCH --output="hellohybrid.%j.%N.out"
    #SBATCH --partition=compute
    #SBATCH --nodes=2
    #SBATCH --ntasks-per-node=24
    #SBATCH --export=ALL
    #SBATCH -t 01:30:00
    
    #This job runs with 2 nodes, 24 cores per node for a total of 48 cores.
    # We use 8 MPI tasks and 6 OpenMP threads per MPI task
    
    export OMP_NUM_THREADS=6
    ibrun --npernode 4 ./hello_hybrid 
    

    ie --tasks-per-node is the number of threads that will be started.

    On Stampede2 on the other hand the docs at https://portal.tacc.utexas.edu/user-guides/stampede2#job-scripts have hybrid script that reads:

    #SBATCH -J myjob           # Job name
    #SBATCH -o myjob.o%j       # Name of stdout output file
    #SBATCH -e myjob.e%j       # Name of stderr error file
    #SBATCH -p skx-normal      # Queue (partition) name
    #SBATCH -N 10              # Total # of nodes 
    #SBATCH -n 40              # Total # of mpi tasks
    #SBATCH -t 01:30:00        # Run time (hh:mm:ss)
    #SBATCH --mail-user=username@tacc.utexas.edu
    #SBATCH --mail-type=all    # Send email at begin and end of job
    #SBATCH -A myproject       # Allocation name (req'd if you have more than 1)
    

    where -n is the short sbatch (https://slurm.schedmd.com/sbatch.html) option for --ntasks but TACC explicitly documents it as being MPI ranks. In TACC’s table https://portal.tacc.utexas.edu/user-guides/stampede2#table6 they also state for --ntasks-per-node: “This is MPI tasks per node.“.

    So inconsistency in simfactory files is driven by inconsistency of the clusters.

  4. Steven R. Brandt reporter

    @Roland Haas I don’t think so. The data you quote from comet is for something called a hybrid MPI job. If you look at the basic mpi job, you will find:

    Basic OpenMP Job
    #!/bin/bash
    #SBATCH --job-name="hello_openmp"
    #SBATCH --output="hello_openmp.%j.%N.out"
    #SBATCH --partition=compute
    #SBATCH --nodes=1
    #SBATCH --ntasks-per-node=24
    #SBATCH --export=ALL
    #SBATCH -t 01:30:00
    
    #SET the number of openmp threads
    export OMP_NUM_THREADS=24
    
    #Run the job using mpirun_rsh
    ./hello_openmp 
    

  5. Steven R. Brandt reporter

    Sorry, my last response didn’t come out right. What I think happens is this:

    If we set -N 2 -n 6 --ntasks-per-node=4 then slurm sets SLURM_TASKS_PER_NODE=4(x2) and SLURM_NTASKS=6 and it’s up to the application to make sense of it.

    mpich runs 8 procs, 4 on each of 2 nodes (the number of nodes in SLURM_NODELIST). Of course, one is free to give explicit parameters to mpi to change these values.

    What I think we want simfactory to do is let mpich (or openmpi) interpret the environment variables how they want. That means --ntasks-per-node should be @NUM_PROCS@ I believe.

  6. Roland Haas

    Seems incorrect to me. @NUM_PROCS@ is the total number of MPI ranks, while --ntasks-per-node is a per node quantity (not matter whether it is MPI ranks or threads/cores).

  7. Steven R. Brandt reporter

    Sorry @Roland Haas I meant to type @NODE_PROCS@. 😛 so I am, apparently, adding to the confusion.

  8. Roland Haas

    Responding to https://bitbucket.org/einsteintoolkit/tickets/issues/2482/inconsistency-in-simfactory-submission#comment-59534195: I picked an hybrid job that uses OpenMP and MPI because for a pure MPI job the number of threads/cores is the same as the number of MPI ranks so there would be no difference in the --ntasks-per-node value no matter whether it counted the MPI ranks or the threads.

    For the hybrid job shown though there are 4 MPI ranks per node (2 nodes, 8 ranks total) but --ntasks-per-node is set to 24 and not 4 which would be @NODE_PROCS@.

    Hence me statement that not all SLRUM clusters use SLURM’s --ntasks-per-node variable (or by induction “tasks”) in the same way and thus one must expect inconsistencies between SLURM using submit and run scripts on different SLURM using clusters. These inconsistencies are induced by the clusters being different.

  9. Steven R. Brandt reporter

    I’m a little unclear about what is different about the clusters. In particular, I’d be interested in knowing if the behavior of this script is different. I expect that Slurm sets its environment variables the same way everywhere and that mpirun (I used mpich) will interpret those variables in a consistent way. The script below prints a hostname 8 times, 4 from one node and 4 from another. Does this basic behavior change by cluster?

    #!/bin/bash
    #SBATCH -N 2 -n 6
    #SBATCH --partition=checkpt
    #SBATCH --ntasks-per-node=4
    
    env | grep SLURM > slurm-env.txt
    # SLURM_TASKS_PER_NODE has format NUM1(xNUM2)
    # SLURM_NTASKS is the product of NUM1 and NUM2
    # SLURM_NNODES is NUM2
    # SLURM_JOB_NUM_NODES is NUM2
    TASKS_PER_NODE=$(echo ${SLURM_TASKS_PER_NODE} | cut -f1 -d\()
    echo "SLURM_TASKS_PER_NODE=$SLURM_TASKS_PER_NODE"
    echo "TASKS_PER_NODE=$TASKS_PER_NODE"
    echo "SLURM_NTASKS=$SLURM_NTASKS"
    echo "SLURM_JOB_NUM_NODES=$SLURM_JOB_NUM_NODES"
    
    mpirun hostname
    

  10. Roland Haas

    The first issue with this script would be that on the two clusters (Stampede2 and Comet) you are not supposed to use mpirun but instead their own ibrun tool. So there’s one difference.

    This tool then potentially (who knows, it’s a black box) looks what --tasks-per-node is set to to do … something.

    Basically: the docs say to do “A”, then you have to do “A” and not “B” which you would do on the cluster you are testing this on.

  11. Roland Haas

    That said I am very happy to add you to my allocations on Comet and Stampede2 if you want to try and dig down what may be going on and how those two differ from each other and a “vanilla” SLURM setup. The files in simfactory right now look the way they look b/c over a couple of ET releases they were found to (mostly) work so I was not going to change them to make them look more similar to each other or how I would think a standard SLURM setup should work.

  12. Steven R. Brandt reporter

    I would be interested in knowing what ibrun does in this script (if it is used in place of mpirun). Docs are great, but they can be wrong, out of date, or misleading.

  13. Roland Haas

    I added your sbrandt XSEDE account to my allocations on Bridges, Stampede2 and Comet.

    TACC uses 2 factor authentication of their own making. . You have to first obtain a password for you TACC account from: https://portal.tacc.utexas.edu/password-reset which will offer to set up TACC's two-factor-authentication. You can try using TACC's own app or the Google Authenticator for Android or iOS (only tested with Android so far). Whenever you use simfactory you must (well if you want to use the Skylake nodes and not the Knight’s landing ones) pass a --machine stampede2-skx option to it to ensure that the "Skylake" section of the Stampede2 is used and not the "Knights landing one". Alternative you can create a file $HOME/.hostname on Stampede2 and put the text stampede2-skx in it. This will cause simfactory to use the skylake section by default. If you are on multiple XSEDE allocations on stampede2 you can use the --allocation option of simfactory to select which one to charge.

    You may have to wait for 1-2 business days for your $HOME to be created and logging in earlier may result in you being dropped to / with a message along the lines “you do not exist” if you are unlucky. This fixes itself once $HOME exists.

  14. Roland Haas

    The thing about documentation is that if one does not follow it and things don’t work then one can hardly complain.

  15. Log in to comment