When a simulation consists of multiple chained jobs, the failure of one job is likely to lead to the failure of subsequent jobs. Possible reasons for failure of a job include:
- Running out of disk quota;
- An error in the code;
- A numerical problem;
- A problem with the cluster
Of all these, only the last could potentially be recovered from by simply running the next job in the chain, and in any case, if this is done immediately, it is likely to fail because the problem may not have resolved itself.
As a result, to avoid wasting CPU hours on the remaining jobs in the chain, I think simfactory should hold or remove the subsequent chained jobs. Probably removing the jobs would be easier and simpler, and users can always run "submit" on them to restart them.