unsuccessful qsub not recognized / submit succeeds for finished simulation

Erik Schnetter

removed comment

I don't quite understand what happened, but here is what I think should have happened:

1. Simfactory should detect that the qsub fails, and should output an error message. Since at this time the restart has already been created, it continues to exist (it is not deleted). The restart is in the "active, finished" state.

2. The next submit cleans up the previous (unsuccessful) restart (as usual), then creates a new restart (as usual). If there are checkpoint files, they are still present in the previous (unsuccessful) restart, and those are used.

2011-03-09T12:15:17+00:00

Frank Löffler reporter

removed comment

In this case I don't really care about the restart still being around - nothing had happened yet, no simulation was ever even submitted, no data could be there other than what simfactory created itself. I can imagine that it might be useful to have these data around for debugging, but otherwise it's useless.

It created a new restart, queued a new job, but all I got from the output when it started (on stderr) was "cannot rerun a restart that has been finished". I would expect the job to be run then.

2011-03-09T12:49:43+00:00

Erik Schnetter

removed comment

Yes, the failed restart is useless. However, there is currently no facility in simfactory to delete restarts. This is potentially a dangerous operation, in particular if something goes wrong (and the wrong restart is deleted). I view the failed restart as more of an eye sore than a problem, since it doen't really get into the way.

Yes, the following restart should have started just fine.

By the way, simfactory should check that the number of processors is consistent with the machine description, and should refuse wrong processor counts. Of course, an error in the MDB can mean that simfactory doesn't detect this. Do you want to open a bug report for this as well?

2011-03-09T14:12:08+00:00

Roland Haas

removed comment

I seem to have similar problems on kraken when qsub fails (in my case the queue was wrong). Simfactories log contains:

[LOG:2012-03-14 15:10:48] self.submit(submitScript)::Executing submission command: /opt/torque/2.5.7/bin/qsub /lustre/scratch/rhaas/simulations/cactustest/output-0000/SIMFACTORY/SubmitScript [LOG:2012-03-14 15:10:48] self.makeActive()::Simulation cactustest with restart-id 0 has been made active [LOG:2012-03-14 15:10:50] job_id = self.extractJobId(output)::received raw output: qsub: Unknown queue MSG=cannot locate queue [LOG:2012-03-14 15:10:50] job_id = self.extractJobId(output):: [LOG:2012-03-14 15:10:50] job_id = self.extractJobId(output)::using submitRegex: (\d+[.]nid[0-9]*) [LOG:2012-03-14 15:10:50] self.submit(submitScript)::After searching raw output, it was determined that the job_id is: -1 [LOG:2012-03-14 15:10:50] self.submit(submitScript)::If this is -1, that means the regex did NOT match anything. No job_id means no control. and trying the qsub command listed, I find that its return value is 172. No error was reported in the original simfactory call.

2012-03-14T14:17:58+00:00

Roland Haas

removed comment

This should be fixed in https://bitbucket.org/simfactory/simfactory2/pull-requests/5/rhaas-warn_to_screen/diff

2015-08-13T05:04:55+00:00

Roland Haas

changed status to resolved
removed comment

2015-08-18T03:35:46+00:00

Roland Haas

edited description
changed status to closed

2019-02-21T20:24:44+00:00

Comments (7)