unsuccessful qsub not recognized / submit succeeds for finished simulation

Create issue
Issue #334 closed
Frank Löffler created an issue

python version:

I submitted a simulation 'sim submit' but the corresponding qsub failed due to wrong numbers of procs/node (philip cluster). I changed the number given on the command line and did a 'sim submit' again, this time successful. Several things happend which I think could be done better:

  • the unsuccessful qsub was not detected during the new submit - it attempted a restart and didn't simply clean the unsuccessful submit
  • when trying the restart, it went ahead and queued the job, but this later failed when run with "cannot rerun a restart that has been finished". This could have been caught earlier - without the wait time in the queue.

Keyword:

Comments (7)

  1. Erik Schnetter
    • removed comment

    I don't quite understand what happened, but here is what I think should have happened:

    1. Simfactory should detect that the qsub fails, and should output an error message. Since at this time the restart has already been created, it continues to exist (it is not deleted). The restart is in the "active, finished" state.

    2. The next submit cleans up the previous (unsuccessful) restart (as usual), then creates a new restart (as usual). If there are checkpoint files, they are still present in the previous (unsuccessful) restart, and those are used.

  2. Frank Löffler reporter
    • removed comment

    In this case I don't really care about the restart still being around - nothing had happened yet, no simulation was ever even submitted, no data could be there other than what simfactory created itself. I can imagine that it might be useful to have these data around for debugging, but otherwise it's useless.

    It created a new restart, queued a new job, but all I got from the output when it started (on stderr) was "cannot rerun a restart that has been finished". I would expect the job to be run then.

  3. Erik Schnetter
    • removed comment

    Yes, the failed restart is useless. However, there is currently no facility in simfactory to delete restarts. This is potentially a dangerous operation, in particular if something goes wrong (and the wrong restart is deleted). I view the failed restart as more of an eye sore than a problem, since it doen't really get into the way.

    Yes, the following restart should have started just fine.

    By the way, simfactory should check that the number of processors is consistent with the machine description, and should refuse wrong processor counts. Of course, an error in the MDB can mean that simfactory doesn't detect this. Do you want to open a bug report for this as well?

  4. Roland Haas
    • removed comment

    I seem to have similar problems on kraken when qsub fails (in my case the queue was wrong). Simfactories log contains:

    [LOG:2012-03-14 15:10:48] self.submit(submitScript)::Executing submission command: /opt/torque/2.5.7/bin/qsub /lustre/scratch/rhaas/simulations/cactustest/output-0000/SIMFACTORY/SubmitScript [LOG:2012-03-14 15:10:48] self.makeActive()::Simulation cactustest with restart-id 0 has been made active [LOG:2012-03-14 15:10:50] job_id = self.extractJobId(output)::received raw output: qsub: Unknown queue MSG=cannot locate queue [LOG:2012-03-14 15:10:50] job_id = self.extractJobId(output):: [LOG:2012-03-14 15:10:50] job_id = self.extractJobId(output)::using submitRegex: (\d+[.]nid[0-9]*) [LOG:2012-03-14 15:10:50] self.submit(submitScript)::After searching raw output, it was determined that the job_id is: -1 [LOG:2012-03-14 15:10:50] self.submit(submitScript)::If this is -1, that means the regex did NOT match anything. No job_id means no control. and trying the qsub command listed, I find that its return value is 172. No error was reported in the original simfactory call.

  5. Log in to comment