If one restart of a simulation exits abnormally, e.g. due to some transient problem on a cluster, all subsequent restarts might also run into the same problem. If we can distinguish between terminations due to internal (i.e. numerical or code-related) problems and external (MPI errors, filesystem issues) problems, we can do different things for each. Possible actions could be:
- Continue as normal with the next restart;
- Delay the next restart for a few hours, in the hope that the transient cluster problems are resolved;
- Hold the next restart and notify the user by email that an unrecoverable error has occurred.
These could be communicated by exit codes (whether through official methods, or through an exit code file). Distinguishing between 2 and 3 could be achieved by regular expression matching on the standard output or standard error file. This would make the mechanism independent of Cactus. So Cactus would only have to say "good" or "bad", and SimFactory could then decide if "bad" meant to delay or hold based on some logic in its machine database.