Checkpointing + new cleanup procedure

Create issue
Issue #344 closed
anonymous created an issue

Attached to this ticket is the patch to fix checkpointing (which Ian has already reviewed) plus code implementing the new cleanup procedure.

The new cleanup procedure is thus:

  1. Automatic cleanup of every simulation is now gone.
  2. Any command that creates a new restart (submit, user initiated run) calls restartlib.CleanupSimulation on that specific simulation, and CleanupSimulation will only attempt to do cleanup if it finds an active restart for that given simulation
  3. sim cleanup without any arguments will cleanup all simulations. If you specify a specific simulation, it will only clean up that one.

Still needing to be implemented are the times when cleanup of all simulations should happen, or the cleanup when a simulation finishes. These need to be done via cronjobs or some other method that hasn't quite been figured out yet. Please svn up to get the latest revision (I committed a bunch of very trivial code changes) then apply this patch and add comments to this ticket related to this patch.

Keyword:

Comments (9)

  1. Ian Hinder
    • removed comment

    This is a good idea, and I will test it out when I get a bit of time. Note that I didn't look at the code for the checkpoint/recovery portion of the patch, I only tested that it solved the problem I was having.

  2. Barry Wardell
    • removed comment

    I'm currently using this patch and things seem to work pretty well so far. Both submitting and presubmitting a job work fine and cleanup seems to do what it should. My testing hasn't been very extensive so far though. I'll continue using this patch and will report back after a bit more usage.

  3. Ian Hinder
    • removed comment

    I have been using the previous patch which only fixed up checkpointing, so I can only comment on that. I haven't tried this one. With the previous patch, recovery works if the job is not currently in the queue. If the job *is* in the queue, and is running, the new job is chained to the old one. If the job in the queue is only in the Q state, however, it does not get detected and the new job is also put into the Q state, not the H state. It might be that the logic for whether to chain or queue the job depends on whether it has run or not, which is not correct, I think.

    Barry: can you check if the following four cases work correctly with this patch?

    1. If no jobs for that simulation are in the queue; 2. If a job is in the queue in the Q state; 3. If a job is in the queue in the R state; 4. If a job is in the queue in the H state (should only be the case if 2. is true as well, but best to check)

    My own tests are on Kraken, which can have long turn-around times, so you might want to test somewhere else like Damiana or Datura. If it works on those machines, then the code is probably OK, and at worst the Kraken machine entry might need to be fixed up.

  4. Barry Wardell
    • removed comment

    Replying to [comment:3 hinder]:

    Barry: can you check if the following four cases work correctly with this patch?

    1. If no jobs for that simulation are in the queue;

    This works using sim create-submit ...

    2. If a job is in the queue in the Q state;

    This seems to work fine. The job from 1 is in the qw state. Using sim submit adds a second job in the hq state.

    3. If a job is in the queue in the R state;

    This also works fine.

    4. If a job is in the queue in the H state (should only be the case if 2. is true as well, but best to check)

    This does not work as it should. Submitting a new job puts the new job into the hq state, but with a dependency on currently running job, not on the existing job which is in the hq state. So I have three jobs in the que:

    0000 in the run state 0001 in the hq state with dependency on 0000 0002 in the hq state with dependency on 0000

  5. Barry Wardell
    • removed comment

    All four cases are now working for me with the latest svn version (and without needing this patch).

  6. Log in to comment