Wiki

Running Sunrise batch jobs

When processing simulations with Sunrise, large numbers of batch jobs need to be set up, started, and monitored, which can be a chore. There is a tool called mcrxrun that makes this easier.

Prerequisites

mcrxrun assumes that you have a directory with simulation snapshots that you want to process, and that those snapshots are named snapshot_nnn, possibly with a .hdf5 suffix. It also assumes that your parameters for the sfrhist program is in two stub files, "sfrhist.stub" and "makegrid.stub" (they just have to be present, they can be symlinks or empty), those for the mcrx program in "mcrx.stub" and for the broadband program in "broadband.stub".

Usage

mcrxrun has 7 options. Running it without one produces this help:

[pjonsson@iliadaccess02 ~]$ mcrxrun
mcrxrun commands:
  create        Creates job files
  start <n>     Starts jobs (or job # n)
  status        Reports job status
  restart       Restarts failed or held jobs
  hold  Holds jobs in queue
  kill  Kills jobs in queue
  cleanup       Removes all mcrxrun files

create -- Looks through the current directory for snapshot files and, for each of them, creates batch job files and config files. The config files read the stub files and specify the names of the input/output files. The convention is that snapshot_123.hdf5 when processed by sfrhist produces grid_123.fits, which when processed by mcrx produces mcrx_123.fits, which when processed by broadband generates broadband_123.fits. The batch job files are created for your selected batch system, see the configuration section below.
start -- Starts the jobs by submitting the sfrhist jobs to the batch system. Without an argument, all snapshots are started, but you can also say start 0 1 3 6 77 and only those will be started. Any jobs already started are ignored.
status -- prints a status output of which of the jobs have completed, are running, or have failed. It looks like something this:
```
cfe2.pjonsson $ mcrxrun status
jobs status :
0: snapshot_0010: sfrhist: C ,  makegrid: - ,  mcrx: C , postprocess: C 
1: snapshot_0018: sfrhist: C ,  makegrid: - ,  mcrx: C , postprocess: C 
2: snapshot_0024: sfrhist: C ,  makegrid: - ,  mcrx: C , postprocess: - 
3: snapshot_0051: sfrhist: C ,  makegrid: - ,  mcrx: C , postprocess: C 
4: snapshot_0059: sfrhist: C ,  makegrid: - ,  mcrx: C , postprocess: C 
5: snapshot_0095: sfrhist: C ,  makegrid: - ,  mcrx: F!, postprocess: - 
```
As you can see, it shows a matrix of all the snapshots in the directory by the 3 jobs necessary to complete it (makegrid is a holdover and not used anymore). The status of the job is indicated by a letter: "-" means not started, "C" means completed, "F!" means failed and (not shown) "Q" means queued and "R" running. This makes it easy to immediately get an overview of how a run is progressing.
restart -- When a job has failed for some reason, you can restart it this way. Without an argument, it restarts all failed jobs, but you can also explicitly list job numbers.
hold -- This does a batch job "hold" on the specified jobs, if supported. Good if you realize you screwed something up and don't want the jobs to start before you're done.
kill -- Kills any queued or running batch jobs. As usual, specify the number if you want to kill a specific one.
cleanup -- Deletes the generated job and config files and the .mcrxrun directory that contains the job data. Once you do this, all history is lost.

Setting up

The code is in Patrik's python repository on BitBucket, in the module mcrxrun.py. To run it from the command line as shown above, put the python directory in your PYTHONPATH and then just make an executable file called mcrxrun containing

#!/n/sw/hpc-001/python-2.7.1/bin/python

import mcrxrun
mcrxrun.commandline()

where the first line obviously should point to your python. I'm pretty sure 2.5 or higher is required.

The configuration of your batch system should be done in a file called local_mcrxrun.py. Here's an example:

[pjonsson@iliadaccess02 python]$ more local_mcrxrun.py
import lsf as batch_system

snapshot_file_base = "snapshot"
queue_class = ""
job_type = ""
serial_job_type = ""
queue= "keck"

sfrhist_limit = 0.5
makegrid_limit = 0.5
mcrx_limit = 24
pp_limit = 2.5

ncpu_keyword=""
sfrhist_ncpus="8"
makegrid_ncpus="8"
mcrx_ncpus="8"
pp_ncpus="1"

job_status_line=1
job_status_column=2

The first line specifies what your batch system is. There are files for LSF, PBS, SGE, LoadLeveler and Slurm. The variables define things like the names of the queues, the wall clock limits of the jobs, the number of cpus to request.

Because there are variations in installations and versions of the batch systems, the outputs look different. The variables like job_status_line tell the script how to parse the output from the batch system job status command. To see exactly how these variables are used, look in the module for the corresponding batch system (i.e. lsf.py for LSF above).

Under the hood

Sometimes the script can't do what you want. Most often, what happens is that a job has completed but for some reason you want to rerun it without starting the entire directory over. In these cases, you need to look in the .mcrxrun directory where the script stores its files. For each started job, there will be a file called something like mcrx-snapshot_005-1332-complete. This indicates that the mcrx job for snapshot 005 has completed. (These files are created by the job files created by mcrxrun create.) If you want to run the job over, just delete this file. For started but not completed jobs, there will be files called "...-jobid" instead, containing the batch system job ID for that job.

Example configuration files

In addition to the LSF example above, which is from the Harvard Odyssey cluster, here are some of my local_mcrxrun.py files on different systems. These should at least serve to get you started.

PBS on NASA Columbia

import pbs as batch_system

snapshot_file_base = "snapshot"
queue_class = "regular"
queue = "normal"
job_type = "parallel"
serial_job_type = "serial"

sfrhist_limit = 1.5
makegrid_limit = 1.5
mcrx_limit = 8
pp_limit = 1.0

ncpu_keyword="ncpus"
sfrhist_ncpus="8"
makegrid_ncpus="8"
mcrx_ncpus="12"
pp_ncpus="4"

job_status_line=-1
job_status_column=7

PBS on PSC Blacklight

import pbs as batch_system

snapshot_file_base = "snapshot"
queue_class = "regular"
queue = "normal"
job_type = "parallel"
serial_job_type = "serial"

sfrhist_limit = 1.5
makegrid_limit = 1.5
mcrx_limit = 8
pp_limit = 1.0

ncpu_keyword="ncpus"
sfrhist_ncpus="8"
makegrid_ncpus="8"
mcrx_ncpus="12"
pp_ncpus="4"

job_status_line=-1
job_status_column=7

PBS on SDSC Trestles

import pbs as batch_system

snapshot_file_base = "snapshot"
queue_class = "regular"
queue = "normal"
job_type = "parallel"
serial_job_type = "serial"

sfrhist_limit = 1.5
makegrid_limit = 1.5
mcrx_limit = 8
pp_limit = 1.0

ncpu_keyword="ncpus"
sfrhist_ncpus="8"
makegrid_ncpus="8"
mcrx_ncpus="12"
pp_ncpus="4"

job_status_line=-1
job_status_column=7

Slurm on Governator

import slurm as batch_system

snapshot_file_base = "snapshot"
queue_class = "batch"
job_type = "parallel"
serial_job_type = "serial"
queue= "batch"

sfrhist_limit = 0.5
makegrid_limit = 0.5
mcrx_limit = 24
pp_limit = 2.5

ncpu_keyword="ncpus"
sfrhist_ncpus="1"
makegrid_ncpus="4"
mcrx_ncpus="8"
pp_ncpus="1"

SGE on TACC Longhorn

import sge as batch_system

snapshot_file_base = "snapshot"
queue_class = "regular"
queue = "normal"
job_type = "gpgpu"
serial_job_type = "data"

sfrhist_limit = 1.5
makegrid_limit = 1.5
mcrx_limit = 6
pp_limit = 1.0

ncpu_keyword="ncpus"
sfrhist_ncpus="8"
makegrid_ncpus="8"
mcrx_ncpus="8"
pp_ncpus="8"

job_status_line=0
job_status_column=4