- changed status to open
- removed comment
Make TerminationTrigger listen to signals
this adds the ability to listen to signals such as the ones that some queuing systems (SLURM for example, torque/moab in theory) can send before terminating a job which can be used to implement interruptible jobs.
Pull request is https://bitbucket.org/cactuscode/cactusutils/pull-requests/12/make-terminationtrigger-listen-to-signals/diff
Keyword: TerminationTrigger
Comments (17)
-
reporter -
- removed comment
This ability sounds good in theory. Is there any system this works on, where it could be tested?
Also, I would suggest to make SIGTERM the default, unless we know of systems where this would be a problem.
-
reporter - removed comment
This was tested by Ian Hinder on Minerva (AEI, SLURM) which is non public unfortunately. Most likely it would work on other SLURM based systems (eg stampede, comet) as well.
I used the do-nothing option as the default since that is what most other Cactus thorns do, just enabling the thorn does not yet do anything. Changing the default to SIGTERM would change the default behaviour of TerminationTrigger as there is no equivalent of a
termination_from_file
option. So if we change to SIGTERM as the default signal I think we need to introduce atermination_from_signal
which must default to false.Even without the preemption support it can be useful to trap eg CTRL-C for a clean shutdown or as a faster alternative to a termination file.
-
- changed status to open
- removed comment
If it was tested by maintainers it doesn't matter much to me if the machine was public or not. Thanks Ian. And thanks Roland for the patch!
-
- removed comment
As I remember, it worked when I sent the signal to the mpiexec process directly, but not when I sent it to the controlling python (simfactory) process (apparently this is correct behaviour; simfactory needs to explicitly install a signal handler to catch it). Since that is the process that SLURM would send to, it seemed that there was still more work to be done in simfactory before this would be usable as intended. I didn't "review" the code in TerminationTrigger. I have now added comments to the pull request.
-
- changed status to open
- removed comment
-
- changed status to open
- removed comment
-
- removed comment
I wouldn't hold back the commit to TerminationTrigger just because Simfactory needs further changes. It would already be nice to have when not using simfactory.
-
- removed comment
Yes I agree; sorry I didn't make that clear in the comment.
-
reporter - changed status to open
- removed comment
I updated the pull request:
- multiple signals are now supported
- signal numbers are supported
- test suites were added
- cleaned up some schedule statements
-
reporter - removed comment
Replying to [comment:10 rhaas]:
I updated the pull request:
- multiple signals are now supported
- signal numbers are supported
- test suites were added
- cleaned up some schedule statements
One more (forced) update: * fix some typos that made asserts no-ops * work around bug in testsuite system
-
- removed comment
Added a question concerning an error condition in the testsuite case, but it otherwise looked fine. Didn't test though.
-
reporter - removed comment
(mostly same comment as on bitbucket).
I think there may be a bit of confusion. TerminationTrigger calls CCTK_TerminateNext which is a clean exit and exits with exit code 0 to the OS. An Abort would be a call to CCTK_Abort or at least CCTK_Error and would return (the former for sure, the latter I'd hope so) a non-zero exit code to the OS.
The test suite checks for termination (which is a successful exit via CCTK_TerminateNext) and checks that termination was triggered via TerminationTrigger by inspecting a grid scalar that is set to 1 when TerminationTrigger requests a termination. An Abort due to an error would be caught by the test suite system as a non-zero exit code of the Cactus executable.
Having said that though, I found that I can actually test more by not skipping the call to CCTK_TerminateNext during the test suite so that the test suite also tests if CCTK_TerminateNext would actually terminate. I have pushed an updated version of the code and test suite.
-
reporter - removed comment
Ok to apply (after the release is fine with me, though having it before may be neat since my favorite cluster may support job termination via signals soon)?
-
reporter - removed comment
This is becoming more interesting again (for me) since BlueWaters will (soon) support receiving signals some time before the OS kills a job due to out-of-walltime events.
So having someone pick up the review would be great (after the release).
-
- changed status to open
- removed comment
I think this has had enough eyes on it now, and I don't think it's going to break anything serious, so please apply.
-
reporter - changed status to resolved
- removed comment
Applied as git hash 870896504f97f0caa44fab39aa094fbd4c2411b3 "TerminationTrigger: test that run actually terminated" of cactusutils .
- Log in to comment