race condition when using "check program"

Comments (5)

Tildeslash repo owner

assigned issue to

Tildeslash

2014-03-27T18:07:40+00:00

Tildeslash repo owner

changed status to on hold

The scenario is true - the behaviour is feature of current "check program" implementation (this test doesn't behave the same way as other check types).

The problem is, that the "check program" is always one cycle behind the reality. This is due to design limitation of current Monit test scheduler - to not block the validation engine with check program execution (runtime can be variable), we execute the program in one cycle, let it finish in the background and collect the exit status in next cycle + evaluate the result. If the status failed, action is done AND at the end of the cycle the check program is started again, so the exit status can be collected in the next cycle.

In your case the script in "exec" action fixes the problem slowly and despite the 'check program' is executed after the "exec" action, it is faster then the service fix and thus it snaps the state where the service was down yet.

We will refactor the test scheduler in the near future - it will allow to run all tests non-blocking and the "check program" will behave as you expect.

Workaround: If your "check program" is used to monitor some process, we recommend to use "check process" instead. Otherwise it is necessary to merge the "status" and "restart" parts in your "check program" script and do the recovery immediately/inline when the problem is detected. The script will return error only if the recovery attempt failed and monit will do error notification.

2014-03-27T23:00:08+00:00

Tildeslash repo owner

removed version

Removing version: 5.7 (automated comment)

2014-05-19T13:57:58+00:00

Tildeslash repo owner

changed component to 1. Monit

2014-07-25T11:23:55+00:00

Tildeslash repo owner

changed component to Monit

2014-07-25T22:49:52+00:00

Tildeslash repo owner
- assigned issue to
  
  Tildeslash
- 2014-03-27T18:07:40+00:00
Tildeslash repo owner
- changed status to on hold
The scenario is true - the behaviour is feature of current "check program" implementation (this test doesn't behave the same way as other check types).

The problem is, that the "check program" is always one cycle behind the reality. This is due to design limitation of current Monit test scheduler - to not block the validation engine with check program execution (runtime can be variable), we execute the program in one cycle, let it finish in the background and collect the exit status in next cycle + evaluate the result. If the status failed, action is done AND at the end of the cycle the check program is started again, so the exit status can be collected in the next cycle.

In your case the script in "exec" action fixes the problem slowly and despite the 'check program' is executed after the "exec" action, it is faster then the service fix and thus it snaps the state where the service was down yet.

We will refactor the test scheduler in the near future - it will allow to run all tests non-blocking and the "check program" will behave as you expect.

Workaround: If your "check program" is used to monitor some process, we recommend to use "check process" instead. Otherwise it is necessary to merge the "status" and "restart" parts in your "check program" script and do the recovery immediately/inline when the problem is detected. The script will return error only if the recovery attempt failed and monit will do error notification.
- 2014-03-27T23:00:08+00:00
Tildeslash repo owner
- removed version
Removing version: 5.7 (automated comment)
- 2014-05-19T13:57:58+00:00
Tildeslash repo owner
- changed component to 1. Monit
- 2014-07-25T11:23:55+00:00
Tildeslash repo owner
- changed component to Monit
- 2014-07-25T22:49:52+00:00
Log in to comment