Fail counter does not reset after program restart
Issue #787
new
Hi, I'm using monit to monitor some tomcats and to monit them I set up a check script and set up monit to use it in this way (an example):
check program fakeApp with path "/etc/monit/checks/fakeApp.bash check"
and with timeout 60 seconds
if status != 0 for 3 cycles then restart
start program = "/etc/monit/checks/fakeApp.bash start" with timeout 90 seconds
stop program = "/etc/monit/checks/fakeApp.bash stop"
if 2 restarts within 2 cycles then stop
group alix
In this case if the check fails 3 times on a row, monit will restart the program. What happens to me is that after the restart, the program is not fully up and the first check return, obviously, an exit code different from 0, but monit recognize the program down for the 4 time and restart again it, as you can see below
[CET Nov 1 22:38:56] error : 'fakeApp' status failed (1) -- Down
[CET Nov 1 22:39:57] error : 'fakeApp' status failed (1) -- Down
[CET Nov 1 22:40:57] error : 'fakeApp' status failed (1) -- Down
[CET Nov 1 22:40:57] info : 'fakeApp' trying to restart
[CET Nov 1 22:40:57] info : 'fakeApp' stop: '/etc/monit/checks/fakeApp.bash stop'
[CET Nov 1 22:40:57] info : 'fakeApp' start: '/etc/monit/checks/fakeApp.bash start'
[CET Nov 1 22:41:57] error : 'fakeApp' status failed (1) -- Down
[CET Nov 1 22:41:57] info : 'fakeApp' trying to restart
[CET Nov 1 22:41:57] info : 'fakeApp' stop: '/etc/monit/checks/fakeApp.bash stop'
[CET Nov 1 22:41:57] info : 'fakeApp' start: '/etc/monit/checks/fakeApp.bash start'
[CET Nov 1 22:42:57] error : 'fakeApp' service restarted 2 times within 2 cycles(s) - stop
[CET Nov 1 22:42:57] info : 'fakeApp' stop: '/etc/monit/checks/fakeApp.bash stop'
I read in an old thread that this bug was resolved in the version 5.18 but I'm using the version 5.25.1.
Is my configuration wrong or the bug is still there?
Thanks.
I observe the same behavior on 5.26.0 (latest) and 5.20.0.
Here is simple test monitrc to reproduce the issue:
Here is what I see in the logs when running with
-v
argument:As can be seen, monit does not respect the
for 10 cycles
clause after the first (and all subsequent) restarts of a service. Seems like the failures counter is not reset after the service restart.I also found the issue
#64which seems to be the same and which is fixed in 5.9. So it seems like a degradation.Since this issue is opened more than a year ago, I would like to know if there is any recommended workaround, until it will be (hopefully) fixed.