Issues with internal checkpointing

Using guidance from John, I created a production workflow at SLAC utilizing the internal (i.e., not using condor) checkpointing facility of phosim. The motivation was to solve the problem of long-running phoSim jobs which required more time than offered either by the SLAC or NERSC batch systems (5 days at SLAC, 2 days at NERSC in the 'shared queue'). Thinking that a modest number of checkpoints would suffice even for the longest jobs (to be confirmed), I selected eight (8) checkpoints. The command file used the two directives, e.g.,:

checkpointtotal 8

checkpointcount 8

What I discovered is that phoSim does not divide up the processing per checkpoint in anything close to equal time periods. The processing of checkpoint #7 is particularly problematic for certain simulation conditions (possibly due to very bright stars). Approximately 15% of the current production is stuck at this point: jobs run out of time. This production will not run successfully at SLAC or at NERSC.

The following spreadsheet documents a handful of randomly selected failing and successful jobs:

https://docs.google.com/spreadsheets/d/16Qpt-qg2HwF_zdaK1bxSil3RI7m-1u4mIPzxyBsHOZo/edit?usp=sharing

For each stanza (corresponding to a single visit of a single sensor), note the "CPU Time" column and note the vast variation in time required for each checkpoint segment.

What is the solution? Due to phoSim's I/O load at startup, it is not practical to arbitrarily increase the number of checkpoints. Even at eight, the I/O load for thousands of checkpointing jobs becomes overwhelming.

PhoSim checkpointing needs a better way to divide the work of one visit into more equally-sized chunks between checkpoints.

Comments (5)