Issues with internal checkpointing

Issue #12 closed
Thomas Glanzman created an issue

Using guidance from John, I created a production workflow at SLAC utilizing the internal (i.e., not using condor) checkpointing facility of phosim. The motivation was to solve the problem of long-running phoSim jobs which required more time than offered either by the SLAC or NERSC batch systems (5 days at SLAC, 2 days at NERSC in the 'shared queue'). Thinking that a modest number of checkpoints would suffice even for the longest jobs (to be confirmed), I selected eight (8) checkpoints. The command file used the two directives, e.g.,:

checkpointtotal 8

checkpointcount 8

What I discovered is that phoSim does not divide up the processing per checkpoint in anything close to equal time periods. The processing of checkpoint #7 is particularly problematic for certain simulation conditions (possibly due to very bright stars). Approximately 15% of the current production is stuck at this point: jobs run out of time. This production will not run successfully at SLAC or at NERSC.

The following spreadsheet documents a handful of randomly selected failing and successful jobs:

https://docs.google.com/spreadsheets/d/16Qpt-qg2HwF_zdaK1bxSil3RI7m-1u4mIPzxyBsHOZo/edit?usp=sharing

For each stanza (corresponding to a single visit of a single sensor), note the "CPU Time" column and note the vast variation in time required for each checkpoint segment.

What is the solution? Due to phoSim's I/O load at startup, it is not practical to arbitrarily increase the number of checkpoints. Even at eight, the I/O load for thousands of checkpointing jobs becomes overwhelming.

PhoSim checkpointing needs a better way to divide the work of one visit into more equally-sized chunks between checkpoints.

Comments (5)

  1. John Peterson

    the multi-threading release in v3.6 resolves this (as there is no I/O increase, but a reduction of wall time).

  2. Thomas Glanzman reporter

    Hi John,

    I think I hear you saying that the "internal check-pointing" mechanism is hereby obsolete and no longer supported?

  3. John Peterson

    i think it probably is obsolete given the multi-threading ability now. but we can still support it, if it still is useful for you or anyone. in fact, i expect if checkpointing still is useful, it will be useful it a "mild" form where just jobs are divided into 2 or 4 or 8 or so chunks for safety reasons. then the points you raised on this thread about not being able to go to arbitrarily large numbers of checkpoints won't be relevant. so for the moment, i'd say use it "as is" if you like, but its probably not necessary and an extra hassle for your workflow.

  4. Thomas Glanzman reporter

    I agree that the internal check-pointing is obsolete given the wide variation in processing between check-points. R.I.P.

    Initial tests with dmtcp check-pointing has, however, been quite promising.

  5. Log in to comment