automatically resubmit run if terminated due to walltime

Create issue
Issue #1868 open
Roland Haas created an issue

This patch adds code to datura's run script that detect if the run was terminated due to walltime running out. In that case it will resubmit the job automatically.

This is an alternative over presubmission which is often more convenient for the user in case simulations fail. Presubmission itself would benefit by Cactus returning a failure code (that simfactory would need to forward) when termination is triggered by an error.

The pull request is here:


Comments (9)

  1. Erik Schnetter
    • removed comment

    This needs to be combined with a check whether presubmission was used (or presubmission needs to be disable for Datura for the time being).

  2. Frank Löffler
    • removed comment

    One issue that might come up is that if machines handle this differently users will be surprised either way: that runs are not automatically re-submitted on other machines, and that runs might go on 'forever' on datura, while on other machines users can count on only using up one walltime-cycle, regardless of how long the parfile requests (assuming it is long enough). Forgetting that on Datura can get costly with time.

    Would a prominent warning on Datura be a good solution?

  3. Erik Schnetter
    • removed comment

    People only read a warning if it's the last line of output of a run that is aborted. And even then they often don't.

  4. Frank Löffler
    • removed comment

    That's true for Cactus simulations. I hope it's not right after submitting a job using simfactory.

  5. Roland Haas reporter
    • removed comment

    I agree to all suggestions:

    1. a test for "Done" must be there
    2. this must only happen when the job is submitted with a '--auto-resubmit' option
    3. Cactus should return a useful return value and simfactory should take to propagate this to the queuing system
    4. datura must not be the only system that behaves differently from all others

    Also, in particular, warning message that are not the last line will not be headed (in fact, no warning will be headed as long as the run proceeds).

    Based on the discussion the idea seems like a good one though (SpEC for example and various private simfactory-like systems have operated like this for a long time and this behaviour is more convenient than presubmission when a failure occurs).

  6. Frank Löffler
    • removed comment

    I also agree that this would be nice-to-have (assuming it works) on all machines. Specifying the length of a simulation in the par-file should be enough. Checkpointing due to wall time should be made as transparent as possible, so I really appreciate the effort in that direction.

  7. Log in to comment