- changed status to open
- removed comment
automatically resubmit run if terminated due to walltime
This patch adds code to datura's run script that detect if the run was terminated due to walltime running out. In that case it will resubmit the job automatically.
This is an alternative over presubmission which is often more convenient for the user in case simulations fail. Presubmission itself would benefit by Cactus returning a failure code (that simfactory would need to forward) when termination is triggered by an error.
The pull request is here: https://bitbucket.org/simfactory/simfactory2/pull-requests/9/datura-automatically-resubmit-if/diff
Keyword:
Comments (9)
-
reporter -
reporter - changed title to automatically resubmit run if terminated due to walltime
- removed comment
-
- removed comment
This needs to be combined with a check whether presubmission was used (or presubmission needs to be disable for Datura for the time being).
-
- removed comment
One issue that might come up is that if machines handle this differently users will be surprised either way: that runs are not automatically re-submitted on other machines, and that runs might go on 'forever' on datura, while on other machines users can count on only using up one walltime-cycle, regardless of how long the parfile requests (assuming it is long enough). Forgetting that on Datura can get costly with time.
Would a prominent warning on Datura be a good solution?
-
- removed comment
People only read a warning if it's the last line of output of a run that is aborted. And even then they often don't.
-
- removed comment
That's true for Cactus simulations. I hope it's not right after submitting a job using simfactory.
-
reporter - removed comment
I agree to all suggestions:
- a test for "Done" must be there
- this must only happen when the job is submitted with a '--auto-resubmit' option
- Cactus should return a useful return value and simfactory should take to propagate this to the queuing system
- datura must not be the only system that behaves differently from all others
Also, in particular, warning message that are not the last line will not be headed (in fact, no warning will be headed as long as the run proceeds).
Based on the discussion the idea seems like a good one though (SpEC for example and various private simfactory-like systems have operated like this for a long time and this behaviour is more convenient than presubmission when a failure occurs).
-
- removed comment
I also agree that this would be nice-to-have (assuming it works) on all machines. Specifying the length of a simulation in the par-file should be enough. Checkpointing due to wall time should be made as transparent as possible, so I really appreciate the effort in that direction.
-
reporter - changed status to open
- removed comment
We want this to be selectable at submit time.
- Log in to comment