race condition when using "check program"

Issue #19 on hold
Michael Bakker created an issue

When trying to implement the "check program" parameter in my setup I experienced a strange behaviour thats probably referable to a general race condiiton. When my "check program" script enters a failure state it takes 2 restart actions to re-enter a functional state although the service is back on track since the first restart.

I set up some small test environment that also includes my monit and init script logfile (see the attached zip). There you can see that after the failure state has been entered monit is executing the status and restart command more or less at the same time. At this point of time a race condition occurs and the restart process takes longer than the status process hence the status process still returns non-functional state. After the next run of the restart and status command luckily no race conditions occurs although this could probably even take some more restart commands to be in functional state again.

I got my actual check(s) implemeted in another way now but I would really like to use the "check program" feature for this kind of situation.

Comments (5)

  1. Tildeslash repo owner

    The scenario is true - the behaviour is feature of current "check program" implementation (this test doesn't behave the same way as other check types).

    The problem is, that the "check program" is always one cycle behind the reality. This is due to design limitation of current Monit test scheduler - to not block the validation engine with check program execution (runtime can be variable), we execute the program in one cycle, let it finish in the background and collect the exit status in next cycle + evaluate the result. If the status failed, action is done AND at the end of the cycle the check program is started again, so the exit status can be collected in the next cycle.

    In your case the script in "exec" action fixes the problem slowly and despite the 'check program' is executed after the "exec" action, it is faster then the service fix and thus it snaps the state where the service was down yet.

    We will refactor the test scheduler in the near future - it will allow to run all tests non-blocking and the "check program" will behave as you expect.

    Workaround: If your "check program" is used to monitor some process, we recommend to use "check process" instead. Otherwise it is necessary to merge the "status" and "restart" parts in your "check program" script and do the recovery immediately/inline when the problem is detected. The script will return error only if the recovery attempt failed and monit will do error notification.

  2. Log in to comment