Monit doesn't wait x cycles after restarting service.

Issue #711 new
Christian Hack created an issue

I'm pretty sure I'm doing this right. This is my current config but have tried a number of variants of times/cycles etc.

check process mytask with pidfile /var/run/mytask.pid
    start program = "/etc/init.d/mytask start" with timeout 120 seconds
    stop program = "/etc/init.d/mytask  stop"

if failed host localhost
    port 8080
    with timeout 30 seconds
    3 times in 3 cycles
    then restart

Processing monitoring and restarting works fine.

The port monitoring though doesn't appear to try three times after restarting. Basically I want it to try 3 times on the port before restarting and that should apply every time so even after a restart it should reset its failed counter. Once it's failed, it only tries once even if the service comes good in the mean time. Am I missing something here?

Feb  1 05:48:37 testsys monit[5320]: Starting Monit 5.25.1 daemon with http interface at [0.0.0.0]:2812
Feb  1 05:48:37 testsys monit[5322]: 'sm01' Monit 5.25.1 started
Feb  1 05:50:07 testsys monit[5322]: 'mytask' failed protocol test [DEFAULT] at [localhost]:8080 [TCP/IP] -- Connection refused
Feb  1 05:50:38 testsys monit[5322]: 'mytask' failed protocol test [DEFAULT] at [localhost]:8080 [TCP/IP] -- Connection refused
Feb  1 05:51:08 testsys monit[5322]: 'mytask' failed protocol test [DEFAULT] at [localhost]:8080 [TCP/IP] -- Connection refused
Feb  1 05:51:08 testsys monit[5322]: 'mytask' trying to restart
Feb  1 05:51:08 testsys monit[5322]: 'mytask' stop: '/etc/init.d/mytask.sh stop'
Feb  1 05:51:08 testsys monit[5322]: 'mytask' start: '/etc/init.d/mytask.sh start'
Feb  1 05:53:09 testsys monit[5322]: 'mytask' failed protocol test [DEFAULT] at [localhost]:8080 [TCP/IP] -- Connection refused
Feb  1 05:53:09 testsys monit[5322]: 'mytask' trying to restart
Feb  1 05:53:09 testsys monit[5322]: 'mytask' stop: '/etc/init.d/mytask.sh stop'
Feb  1 05:53:09 testsys monit[5322]: 'mytask' start: '/etc/init.d/mytask.sh start'
Feb  1 05:55:09 testsys monit[5322]: 'mytask' failed protocol test [DEFAULT] at [localhost]:8080 [TCP/IP] -- Connection refused
Feb  1 05:55:09 testsys monit[5322]: 'mytask' trying to restart
Feb  1 05:55:09 testsys monit[5322]: 'mytask' stop: '/etc/init.d/mytask.sh stop'
Feb  1 05:55:10 testsys monit[5322]: 'mytask' start: '/etc/init.d/mytask.sh start'
Feb  1 05:57:10 testsys monit[5322]: 'mytask' failed protocol test [DEFAULT] at [localhost]:8080 [TCP/IP] -- Connection refused
Feb  1 05:57:10 testsys monit[5322]: 'mytask' trying to restart
Feb  1 05:57:10 testsys monit[5322]: 'mytask' stop: '/etc/init.d/mytask.sh stop'
Feb  1 05:57:11 testsys monit[5322]: 'mytask' start: '/etc/init.d/mytask.sh start'
Feb  1 05:59:11 testsys monit[5322]: 'mytask' failed protocol test [DEFAULT] at [localhost]:8080 [TCP/IP] -- Connection refused
Feb  1 05:59:11 testsys monit[5322]: 'mytask' trying to restart
Feb  1 05:59:11 testsys monit[5322]: 'mytask' stop: '/etc/init.d/mytask.sh stop'
Feb  1 05:59:11 testsys monit[5322]: 'mytask' start: '/etc/init.d/mytask.sh start'
Feb  1 06:01:12 testsys monit[5322]: 'mytask' failed protocol test [DEFAULT] at [localhost]:8080 [TCP/IP] -- Connection refused
Feb  1 06:01:12 testsys monit[5322]: 'mytask' trying to restart
Feb  1 06:01:12 testsys monit[5322]: 'mytask' stop: '/etc/init.d/mytask.sh stop'
Feb  1 06:01:12 testsys monit[5322]: 'mytask' start: '/etc/init.d/mytask.sh start'
Feb  1 06:03:13 testsys monit[5322]: 'mytask' connection succeeded to [localhost]:8080 [TCP/IP]
Feb  1 06:08:44 testsys monit[5322]: 'mytask' failed protocol test [DEFAULT] at [localhost]:8080 [TCP/IP] -- Connection refused
Feb  1 06:08:44 testsys monit[5322]: 'mytask' trying to restart
Feb  1 06:08:44 testsys monit[5322]: 'mytask' stop: '/etc/init.d/mytask.sh stop'
Feb  1 06:08:45 testsys monit[5322]: 'mytask' start: '/etc/init.d/mytask.sh start'

I'm using 5.25.1 of monit (but can't select that in the version dropdown) on Yocto Linux (morty). I don't think there's any issue with the general setup though.

Comments (6)

  1. sebastian

    Exactly the same problem here with a plain monit 5.16 on Ubuntu 16:

    check process process_x
      with pidfile /path/to/pidfile.pid
      alert me@example but not on { pid, ppid, resource }
      group processgroup_1
      start "/bin/true"
      stop "/bin/bash -c 'MYPID=`/bin/cat /path/to/pidfile.pid` && /bin/kill -QUIT $MYPID && ((sleep 1 && test -e /proc/$MYPID && kill -TERM $MYPID) || /bin/true)'"
      if mem is greater than 300.0 MB for 2 cycles then restart
      if does not exist then exec "/bin/true"
    

    It works the first time and logs two lines before attempting to restart and then logs one line that the process is again within the limit:

    [CEST Apr 12 12:19:30] error    : 'process_x' mem amount of 472.3 MB matches resource limit [mem amount>300 MB]
    [CEST Apr 12 12:20:30] error    : 'process_x' mem amount of 1.2 GB matches resource limit [mem amount>300 MB]
    [CEST Apr 12 12:20:30] info     : 'process_x' trying to restart
    [CEST Apr 12 12:20:30] info     : 'process_x' stop: /bin/bash
    [CEST Apr 12 12:21:32] info     : 'process_x' mem amount check succeeded [current mem amount=120.4 MB]
    

    But after the first time it only logs one line and restarts directly with the first cycle matching the memory limit:

    [CEST Apr 12 14:34:54] error    : 'process_x' mem amount of 306.9 MB matches resource limit [mem amount>300 MB]
    [CEST Apr 12 14:34:54] info     : 'process_x' trying to restart
    [CEST Apr 12 14:34:54] info     : 'process_x' stop: /bin/bash
    [CEST Apr 12 14:35:55] info     : 'process_x' mem amount check succeeded [current mem amount=140.5 MB]
    

    Any suggestions?

  2. Christian Hack reporter

    I wanted to use monit to restart some services based on CPU and/or memory but really need the x cycles parameter to work properly.

    Any chance of taking a look at this one or at least pointing where to look in the source? If you can give me a nudge in the right direction I may be able to fix it myself.

  3. Christian Hack reporter

    If a trigger occurs but doesn’t clear it will keep triggering for every cycle until it clears. This is a bit of a problem for us restarting a process but can be partly overcome by using the “timeout” option used when starting the service. Alerts don’t seem to retrigger for some reason whereas service restarts do.

    The logging is a bit confusing particularly when testing. The first time the logs are warning level and the final trigger log is error level. Thereafter, the leadup messages are debug level so don’t appear unless you have debug enabled (-v). There’s not really enough logs to see if that’s your issue @Sebastian

    I’m not quite sure how to resolve since in some cases, you don’t want it to keep triggering and some you do (although it should be after y cycles or x times in y cycles)

    There is this bug (I’m pretty certain) here that will affect any “x times in y cycles” (where x is 1 for y=1-8, x=1-2 for y=9-16 and so on) monitoring as it fills the state_map with 1 bit in each byte when it should be zeroing it.

    https://bitbucket.org/tildeslash/monit/src/c8e3ee8b5b7fa27b8bdc888e1a343b937cb58336/src/event.c#lines-163 should read:

                    memset(&(E->state_map), 0, sizeof(E->state_map)); // Restart state map on state change, so we'll not flicker on multiple-failures condition (next state change requires full number of cycles to pass)
    
  4. Log in to comment