Fail counter does not reset after program restart

Issue #787 new
Pinux created an issue

Hi, I'm using monit to monitor some tomcats and to monit them I set up a check script and set up monit to use it in this way (an example):

check program fakeApp with path "/etc/monit/checks/fakeApp.bash check"
  and with timeout 60 seconds
  if status != 0 for 3 cycles then restart
start program = "/etc/monit/checks/fakeApp.bash start" with timeout 90 seconds
stop program = "/etc/monit/checks/fakeApp.bash stop"
if 2 restarts within 2 cycles then stop
group alix

In this case if the check fails 3 times on a row, monit will restart the program. What happens to me is that after the restart, the program is not fully up and the first check return, obviously, an exit code different from 0, but monit recognize the program down for the 4 time and restart again it, as you can see below

[CET Nov  1 22:38:56] error    : 'fakeApp' status failed (1) -- Down
[CET Nov  1 22:39:57] error    : 'fakeApp' status failed (1) -- Down
[CET Nov  1 22:40:57] error    : 'fakeApp' status failed (1) -- Down
[CET Nov  1 22:40:57] info     : 'fakeApp' trying to restart
[CET Nov  1 22:40:57] info     : 'fakeApp' stop: '/etc/monit/checks/fakeApp.bash stop'
[CET Nov  1 22:40:57] info     : 'fakeApp' start: '/etc/monit/checks/fakeApp.bash start'
[CET Nov  1 22:41:57] error    : 'fakeApp' status failed (1) -- Down
[CET Nov  1 22:41:57] info     : 'fakeApp' trying to restart
[CET Nov  1 22:41:57] info     : 'fakeApp' stop: '/etc/monit/checks/fakeApp.bash stop'
[CET Nov  1 22:41:57] info     : 'fakeApp' start: '/etc/monit/checks/fakeApp.bash start'
[CET Nov  1 22:42:57] error    : 'fakeApp' service restarted 2 times within 2 cycles(s) - stop
[CET Nov  1 22:42:57] info     : 'fakeApp' stop: '/etc/monit/checks/fakeApp.bash stop'

I read in an old thread that this bug was resolved in the version 5.18 but I'm using the version 5.25.1.

Is my configuration wrong or the bug is still there?

Thanks.

Comments (2)

  1. Alex Korotkin

    I observe the same behavior on 5.26.0 (latest) and 5.20.0.

    Here is simple test monitrc to reproduce the issue:

    set idfile /home/user/monit-test/monit.id
    set logfile syslog
    set pidfile /home/user/monit-test/monit.pid
    set statefile /home/user/monit-test/monit.state
    
    set daemon 1
    
    check process sshd pidfile /var/run/sshd.pid
       start program = "/usr/sbin/service ssh start" with timeout 5 seconds
       stop program = "/usr/sbin/service ssh stop" with timeout 5 seconds
       if failed port 80 protocol http request "/not-existing" for 10 cycles then restart
    

    Here is what I see in the logs when running with -v argument:

    Jan 17 12:58:48 ubuntu monit[2850]: pidfile '/home/user/monit-test/monit.pid' does not exist
    Jan 17 12:58:48 ubuntu monit[2850]: Starting Monit 5.26.0 daemon
    Jan 17 12:58:48 ubuntu monit[2852]: 'ubuntu' Monit 5.26.0 started
    Jan 17 12:58:48 ubuntu monit[2852]: Cannot open proc file '/proc/2849/stat' -- No such file or directory
    Jan 17 12:58:48 ubuntu monit[2852]: system statistic error -- cannot read /proc/2849/stat
    Jan 17 12:58:48 ubuntu monit[2852]: 'sshd' process is running with pid 2676
    Jan 17 12:58:48 ubuntu monit[2852]: 'sshd' zombie check succeeded
    Jan 17 12:58:48 ubuntu monit[2852]: Socket test failed for [127.0.0.1]:80 -- Connection refused
    Jan 17 12:58:48 ubuntu monit[2852]: 'sshd' failed protocol test [HTTP] at [localhost]:80/not-existing [TCP/IP] -- Connection refused
    Jan 17 12:58:49 ubuntu monit[2852]: 'sshd' process is running with pid 2676
    Jan 17 12:58:49 ubuntu monit[2852]: 'sshd' zombie check succeeded
    Jan 17 12:58:49 ubuntu monit[2852]: Socket test failed for [127.0.0.1]:80 -- Connection refused
    Jan 17 12:58:49 ubuntu monit[2852]: 'sshd' failed protocol test [HTTP] at [localhost]:80/not-existing [TCP/IP] -- Connection refused
    Jan 17 12:58:50 ubuntu monit[2852]: 'sshd' process is running with pid 2676
    Jan 17 12:58:50 ubuntu monit[2852]: 'sshd' zombie check succeeded
    Jan 17 12:58:50 ubuntu monit[2852]: Socket test failed for [127.0.0.1]:80 -- Connection refused
    Jan 17 12:58:50 ubuntu monit[2852]: 'sshd' failed protocol test [HTTP] at [localhost]:80/not-existing [TCP/IP] -- Connection refused
    Jan 17 12:58:51 ubuntu monit[2852]: 'sshd' process is running with pid 2676
    Jan 17 12:58:51 ubuntu monit[2852]: 'sshd' zombie check succeeded
    Jan 17 12:58:51 ubuntu monit[2852]: Socket test failed for [127.0.0.1]:80 -- Connection refused
    Jan 17 12:58:51 ubuntu monit[2852]: 'sshd' failed protocol test [HTTP] at [localhost]:80/not-existing [TCP/IP] -- Connection refused
    Jan 17 12:58:52 ubuntu monit[2852]: 'sshd' process is running with pid 2676
    Jan 17 12:58:52 ubuntu monit[2852]: 'sshd' zombie check succeeded
    Jan 17 12:58:52 ubuntu monit[2852]: Socket test failed for [127.0.0.1]:80 -- Connection refused
    Jan 17 12:58:52 ubuntu monit[2852]: 'sshd' failed protocol test [HTTP] at [localhost]:80/not-existing [TCP/IP] -- Connection refused
    Jan 17 12:58:53 ubuntu monit[2852]: 'sshd' process is running with pid 2676
    Jan 17 12:58:53 ubuntu monit[2852]: 'sshd' zombie check succeeded
    Jan 17 12:58:53 ubuntu monit[2852]: Socket test failed for [127.0.0.1]:80 -- Connection refused
    Jan 17 12:58:53 ubuntu monit[2852]: 'sshd' failed protocol test [HTTP] at [localhost]:80/not-existing [TCP/IP] -- Connection refused
    Jan 17 12:58:54 ubuntu monit[2852]: 'sshd' process is running with pid 2676
    Jan 17 12:58:54 ubuntu monit[2852]: 'sshd' zombie check succeeded
    Jan 17 12:58:54 ubuntu monit[2852]: Socket test failed for [127.0.0.1]:80 -- Connection refused
    Jan 17 12:58:54 ubuntu monit[2852]: 'sshd' failed protocol test [HTTP] at [localhost]:80/not-existing [TCP/IP] -- Connection refused
    Jan 17 12:58:55 ubuntu monit[2852]: 'sshd' process is running with pid 2676
    Jan 17 12:58:55 ubuntu monit[2852]: 'sshd' zombie check succeeded
    Jan 17 12:58:55 ubuntu monit[2852]: Socket test failed for [127.0.0.1]:80 -- Connection refused
    Jan 17 12:58:55 ubuntu monit[2852]: 'sshd' failed protocol test [HTTP] at [localhost]:80/not-existing [TCP/IP] -- Connection refused
    Jan 17 12:58:56 ubuntu monit[2852]: 'sshd' process is running with pid 2676
    Jan 17 12:58:56 ubuntu monit[2852]: 'sshd' zombie check succeeded
    Jan 17 12:58:56 ubuntu monit[2852]: Socket test failed for [127.0.0.1]:80 -- Connection refused
    Jan 17 12:58:56 ubuntu monit[2852]: 'sshd' failed protocol test [HTTP] at [localhost]:80/not-existing [TCP/IP] -- Connection refused
    Jan 17 12:58:57 ubuntu monit[2852]: 'sshd' process is running with pid 2676
    Jan 17 12:58:57 ubuntu monit[2852]: 'sshd' zombie check succeeded
    Jan 17 12:58:57 ubuntu monit[2852]: Socket test failed for [127.0.0.1]:80 -- Connection refused
    Jan 17 12:58:57 ubuntu monit[2852]: 'sshd' failed protocol test [HTTP] at [localhost]:80/not-existing [TCP/IP] -- Connection refused
    Jan 17 12:58:57 ubuntu monit[2852]: 'sshd' trying to restart
    Jan 17 12:58:57 ubuntu monit[2852]: 'sshd' stop: '/usr/sbin/service ssh stop'
    Jan 17 12:58:57 ubuntu systemd[1]: Stopping OpenBSD Secure Shell server...
    Jan 17 12:58:57 ubuntu systemd[1]: Stopped OpenBSD Secure Shell server.
    Jan 17 12:58:57 ubuntu monit[2852]: 'sshd' stopped
    Jan 17 12:58:57 ubuntu monit[2852]: pidfile '/var/run/sshd.pid' does not exist
    Jan 17 12:58:57 ubuntu monit[2852]: 'sshd' start: '/usr/sbin/service ssh start'
    Jan 17 12:58:58 ubuntu systemd[1]: Starting OpenBSD Secure Shell server...
    Jan 17 12:58:58 ubuntu systemd[1]: Started OpenBSD Secure Shell server.
    Jan 17 12:58:58 ubuntu monit[2852]: 'sshd' started
    Jan 17 12:58:59 ubuntu monit[2852]: 'sshd' process is running with pid 2920
    Jan 17 12:58:59 ubuntu monit[2852]: 'sshd' zombie check succeeded
    Jan 17 12:58:59 ubuntu monit[2852]: 'sshd' connection test paused for 3 s while the process is starting
    Jan 17 12:59:00 ubuntu monit[2852]: 'sshd' process is running with pid 2920
    Jan 17 12:59:00 ubuntu monit[2852]: 'sshd' zombie check succeeded
    Jan 17 12:59:00 ubuntu monit[2852]: 'sshd' connection test paused for 2 s while the process is starting
    Jan 17 12:59:01 ubuntu monit[2852]: 'sshd' process is running with pid 2920
    Jan 17 12:59:01 ubuntu monit[2852]: 'sshd' zombie check succeeded
    Jan 17 12:59:01 ubuntu monit[2852]: 'sshd' connection test paused for 1 s while the process is starting
    Jan 17 12:59:02 ubuntu monit[2852]: 'sshd' process is running with pid 2920
    Jan 17 12:59:02 ubuntu monit[2852]: 'sshd' zombie check succeeded
    Jan 17 12:59:02 ubuntu monit[2852]: 'sshd' connection test paused for 0 ms while the process is starting
    Jan 17 12:59:03 ubuntu monit[2852]: 'sshd' process is running with pid 2920
    Jan 17 12:59:03 ubuntu monit[2852]: 'sshd' zombie check succeeded
    Jan 17 12:59:03 ubuntu monit[2852]: Socket test failed for [127.0.0.1]:80 -- Connection refused
    Jan 17 12:59:03 ubuntu monit[2852]: 'sshd' failed protocol test [HTTP] at [localhost]:80/not-existing [TCP/IP] -- Connection refused
    Jan 17 12:59:03 ubuntu monit[2852]: 'sshd' trying to restart
    Jan 17 12:59:03 ubuntu monit[2852]: 'sshd' stop: '/usr/sbin/service ssh stop'
    Jan 17 12:59:03 ubuntu systemd[1]: Stopping OpenBSD Secure Shell server...
    Jan 17 12:59:03 ubuntu systemd[1]: Stopped OpenBSD Secure Shell server.
    Jan 17 12:59:03 ubuntu monit[2852]: 'sshd' stopped
    

    As can be seen, monit does not respect the for 10 cycles clause after the first (and all subsequent) restarts of a service. Seems like the failures counter is not reset after the service restart.

    I also found the issue #64 which seems to be the same and which is fixed in 5.9. So it seems like a degradation.

    Since this issue is opened more than a year ago, I would like to know if there is any recommended workaround, until it will be (hopefully) fixed.

  2. Log in to comment