Wiki
Clone wikiSunrise / SunriseBatchJobs
Running Sunrise batch jobs
When processing simulations with Sunrise, large numbers of batch jobs need to be set up, started, and monitored, which can be a chore. There is a tool called mcrxrun
that makes this easier.
Prerequisites
mcrxrun
assumes that you have a directory with simulation snapshots that you want to process, and that those snapshots are named snapshot_nnn
, possibly with a .hdf5 suffix. It also assumes that your parameters for the sfrhist program is in two stub files, "sfrhist.stub" and "makegrid.stub" (they just have to be present, they can be symlinks or empty), those for the mcrx program in "mcrx.stub" and for the broadband program in "broadband.stub".
Usage
mcrxrun
has 7 options. Running it without one produces this help:
[pjonsson@iliadaccess02 ~]$ mcrxrun mcrxrun commands: create Creates job files start <n> Starts jobs (or job # n) status Reports job status restart Restarts failed or held jobs hold Holds jobs in queue kill Kills jobs in queue cleanup Removes all mcrxrun files
- create -- Looks through the current directory for snapshot files and, for each of them, creates batch job files and config files. The config files read the stub files and specify the names of the input/output files. The convention is that snapshot_123.hdf5 when processed by sfrhist produces grid_123.fits, which when processed by mcrx produces mcrx_123.fits, which when processed by broadband generates broadband_123.fits. The batch job files are created for your selected batch system, see the configuration section below.
- start -- Starts the jobs by submitting the sfrhist jobs to the batch system. Without an argument, all snapshots are started, but you can also say
start 0 1 3 6 77
and only those will be started. Any jobs already started are ignored. -
status -- prints a status output of which of the jobs have completed, are running, or have failed. It looks like something this:
cfe2.pjonsson $ mcrxrun status jobs status : 0: snapshot_0010: sfrhist: C , makegrid: - , mcrx: C , postprocess: C 1: snapshot_0018: sfrhist: C , makegrid: - , mcrx: C , postprocess: C 2: snapshot_0024: sfrhist: C , makegrid: - , mcrx: C , postprocess: - 3: snapshot_0051: sfrhist: C , makegrid: - , mcrx: C , postprocess: C 4: snapshot_0059: sfrhist: C , makegrid: - , mcrx: C , postprocess: C 5: snapshot_0095: sfrhist: C , makegrid: - , mcrx: F!, postprocess: -
As you can see, it shows a matrix of all the snapshots in the directory by the 3 jobs necessary to complete it (makegrid is a holdover and not used anymore). The status of the job is indicated by a letter: "-" means not started, "C" means completed, "F!" means failed and (not shown) "Q" means queued and "R" running. This makes it easy to immediately get an overview of how a run is progressing.
-
restart -- When a job has failed for some reason, you can restart it this way. Without an argument, it restarts all failed jobs, but you can also explicitly list job numbers.
- hold -- This does a batch job "hold" on the specified jobs, if supported. Good if you realize you screwed something up and don't want the jobs to start before you're done.
- kill -- Kills any queued or running batch jobs. As usual, specify the number if you want to kill a specific one.
- cleanup -- Deletes the generated job and config files and the
.mcrxrun
directory that contains the job data. Once you do this, all history is lost.
Setting up
The code is in Patrik's python repository on BitBucket, in the module mcrxrun.py. To run it from the command line as shown above, put the python directory in your PYTHONPATH and then just make an executable file called mcrxrun
containing
#!/n/sw/hpc-001/python-2.7.1/bin/python import mcrxrun mcrxrun.commandline()
The configuration of your batch system should be done in a file called local_mcrxrun.py
. Here's an example:
[pjonsson@iliadaccess02 python]$ more local_mcrxrun.py import lsf as batch_system snapshot_file_base = "snapshot" queue_class = "" job_type = "" serial_job_type = "" queue= "keck" sfrhist_limit = 0.5 makegrid_limit = 0.5 mcrx_limit = 24 pp_limit = 2.5 ncpu_keyword="" sfrhist_ncpus="8" makegrid_ncpus="8" mcrx_ncpus="8" pp_ncpus="1" job_status_line=1 job_status_column=2
Because there are variations in installations and versions of the batch systems, the outputs look different. The variables like job_status_line
tell the script how to parse the output from the batch system job status command. To see exactly how these variables are used, look in the module for the corresponding batch system (i.e. lsf.py for LSF above).
Under the hood
Sometimes the script can't do what you want. Most often, what happens is that a job has completed but for some reason you want to rerun it without starting the entire directory over. In these cases, you need to look in the .mcrxrun
directory where the script stores its files. For each started job, there will be a file called something like mcrx-snapshot_005-1332-complete
. This indicates that the mcrx job for snapshot 005 has completed. (These files are created by the job files created by mcrxrun create
.) If you want to run the job over, just delete this file. For started but not completed jobs, there will be files called "...-jobid" instead, containing the batch system job ID for that job.
Example configuration files
In addition to the LSF example above, which is from the Harvard Odyssey cluster, here are some of my local_mcrxrun.py
files on different systems. These should at least serve to get you started.
-
PBS on NASA Columbia
import pbs as batch_system snapshot_file_base = "snapshot" queue_class = "regular" queue = "normal" job_type = "parallel" serial_job_type = "serial" sfrhist_limit = 1.5 makegrid_limit = 1.5 mcrx_limit = 8 pp_limit = 1.0 ncpu_keyword="ncpus" sfrhist_ncpus="8" makegrid_ncpus="8" mcrx_ncpus="12" pp_ncpus="4" job_status_line=-1 job_status_column=7
-
PBS on PSC Blacklight
import pbs as batch_system snapshot_file_base = "snapshot" queue_class = "regular" queue = "normal" job_type = "parallel" serial_job_type = "serial" sfrhist_limit = 1.5 makegrid_limit = 1.5 mcrx_limit = 8 pp_limit = 1.0 ncpu_keyword="ncpus" sfrhist_ncpus="8" makegrid_ncpus="8" mcrx_ncpus="12" pp_ncpus="4" job_status_line=-1 job_status_column=7
-
PBS on SDSC Trestles
import pbs as batch_system snapshot_file_base = "snapshot" queue_class = "regular" queue = "normal" job_type = "parallel" serial_job_type = "serial" sfrhist_limit = 1.5 makegrid_limit = 1.5 mcrx_limit = 8 pp_limit = 1.0 ncpu_keyword="ncpus" sfrhist_ncpus="8" makegrid_ncpus="8" mcrx_ncpus="12" pp_ncpus="4" job_status_line=-1 job_status_column=7
-
Slurm on Governator
import slurm as batch_system snapshot_file_base = "snapshot" queue_class = "batch" job_type = "parallel" serial_job_type = "serial" queue= "batch" sfrhist_limit = 0.5 makegrid_limit = 0.5 mcrx_limit = 24 pp_limit = 2.5 ncpu_keyword="ncpus" sfrhist_ncpus="1" makegrid_ncpus="4" mcrx_ncpus="8" pp_ncpus="1"
-
SGE on TACC Longhorn
import sge as batch_system snapshot_file_base = "snapshot" queue_class = "regular" queue = "normal" job_type = "gpgpu" serial_job_type = "data" sfrhist_limit = 1.5 makegrid_limit = 1.5 mcrx_limit = 6 pp_limit = 1.0 ncpu_keyword="ncpus" sfrhist_ncpus="8" makegrid_ncpus="8" mcrx_ncpus="8" pp_ncpus="8" job_status_line=0 job_status_column=4
Updated