monit does not reset counter once application has successfully restarted

Issue #64 resolved
Gordon Cook
created an issue

Configuration

check process meryl with pidfile /var/run/meryl.pid
  start program = "/bin/bash -c '/bin/sleep 5 && /etc/init.d/meryl start'" with timeout 60 seconds
  stop program = "/etc/init.d/meryl stop"
  restart program = "/etc/init.d/meryl restart"
  # 5% for meryl, meryl has a fixed 50MB + size of data centre so use percentage
  #if totalmem > 5% for 5 times within 5 cycles then restart   # eating up memory?
  if totalmem > 35 MB for 5 cycles then restart   # eating up memory?
  if totalcpu > 12% for 5 times within 5 cycles then restart # hung process?
  group server

This is what monit does

Jun 23 15:19:44 command monit[16201]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=28852kB] 
Jun 23 15:22:01 command monit[20793]: 'meryl' total mem amount of 94020kB matches resource limit [total mem amount>35840kB] 
Jun 23 15:31:22 command monit[20793]: 'meryl' total mem amount of 44584kB matches resource limit [total mem amount>35840kB] 
Jun 23 15:32:26 command monit[20793]: 'meryl' total mem amount of 804332kB matches resource limit [total mem amount>35840kB] 
Jun 23 15:33:30 command monit[20793]: 'meryl' total mem amount of 1070312kB matches resource limit [total mem amount>35840kB] 
Jun 23 15:34:34 command monit[20793]: 'meryl' total mem amount of 1097624kB matches resource limit [total mem amount>35840kB] 
Jun 23 15:35:19 command monit[20793]: 'meryl' total mem amount of 46404kB matches resource limit [total mem amount>35840kB] 
Jun 23 15:35:19 command monit[20793]: 'meryl' trying to restart 
Jun 23 15:35:19 command monit[20793]: 'meryl' stop: /etc/init.d/meryl 
Jun 23 15:35:19 command monit[20793]: 'meryl' start: /bin/bash 
Jun 23 15:36:30 command monit[20793]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=29488kB] 
Jun 23 15:47:30 command monit[20793]: 'meryl' total mem amount of 51128kB matches resource limit [total mem amount>35840kB] 
Jun 23 15:47:30 command monit[20793]: 'meryl' trying to restart 
Jun 23 15:47:30 command monit[20793]: 'meryl' stop: /etc/init.d/meryl 
Jun 23 15:47:30 command monit[20793]: 'meryl' start: /bin/bash 
Jun 23 15:48:40 command monit[20793]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=24720kB] 
Jun 23 15:58:37 command monit[20793]: 'meryl' total mem amount of 923192kB matches resource limit [total mem amount>35840kB] 
Jun 23 15:58:37 command monit[20793]: 'meryl' trying to restart 
Jun 23 15:58:37 command monit[20793]: 'meryl' stop: /etc/init.d/meryl 
Jun 23 15:58:37 command monit[20793]: 'meryl' start: /bin/bash 
Jun 23 15:59:47 command monit[20793]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=29560kB] 
Jun 23 16:40:32 command monit[20793]: 'meryl' total mem amount of 51572kB matches resource limit [total mem amount>35840kB] 
Jun 23 16:40:32 command monit[20793]: 'meryl' trying to restart 
Jun 23 16:40:32 command monit[20793]: 'meryl' stop: /etc/init.d/meryl 
Jun 23 16:40:32 command monit[20793]: 'meryl' start: /bin/bash 
Jun 23 16:41:44 command monit[20793]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=24752kB] 
Jun 23 17:07:45 command monit[20793]: 'meryl' total mem amount of 920836kB matches resource limit [total mem amount>35840kB] 
Jun 23 17:07:45 command monit[20793]: 'meryl' trying to restart 
Jun 23 17:07:45 command monit[20793]: 'meryl' stop: /etc/init.d/meryl 
Jun 23 17:07:45 command monit[20793]: 'meryl' start: /bin/bash 
Jun 23 17:08:56 command monit[20793]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=22284kB] 
command[CT02]:> 

never resets counter on the successful check

Now it is broken it restarts every fail, rather than every 5 fails in a row.

Lets check it is the same for old configuration.

Tried with

  • if totalmem > 35 MB for 3 times within 5 cycles then restart
  • if totalmem > 35 MB for 5 times within 5 cycles then restart
  • if totalmem > 35 MB for 5 cycles then restart

Tried with 5.5 5.6 * 5.8.1 All Ubuntu binaries on a Debian system, which installs and runs without any other issues.

Comments (19)

  1. Gordon Cook reporter

    Hi tildeslash,

    Just thought I would download and test 5.9, does not seem to be fixed in my case......still restarts it after 1 failure after the success. I assumed a success test was resetting the count, so that it took 5 more before it restarted but seems to just restart it again.....

      if totalmem > 2% for 5 times within 5 cycles then restart   # eating up memory?
    
    
    Sep 26 15:07:49 command monit[29724]: 'meryl' total mem amount of 8.1% matches resource limit [total mem amount>2.0%]
    Sep 26 15:08:51 command monit[29724]: 'meryl' total mem amount of 10.6% matches resource limit [total mem amount>2.0%]
    Sep 26 15:09:54 command monit[29724]: 'meryl' total mem amount of 14.1% matches resource limit [total mem amount>2.0%]
    Sep 26 15:10:56 command monit[29724]: 'meryl' total mem amount of 14.4% matches resource limit [total mem amount>2.0%]
    Sep 26 15:11:59 command monit[29724]: 'meryl' total mem amount of 14.4% matches resource limit [total mem amount>2.0%]
    Sep 26 15:11:59 command monit[29724]: 'meryl' trying to restart
    Sep 26 15:11:59 command monit[29724]: 'meryl' stop: /etc/init.d/meryl
    Sep 26 15:11:59 command monit[29724]: 'meryl' start: /bin/bash
    Sep 26 15:13:07 command monit[29724]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=0.4%]
    
    7-8 successful checks must have took place here, then a failure , then restart - No, it should be 5 failures then a restart as counter was reset
    
    Sep 26 15:25:39 command monit[29724]: 'meryl' total mem amount of 14.5% matches resource limit [total mem amount>2.0%]
    Sep 26 15:25:39 command monit[29724]: 'meryl' trying to restart
    Sep 26 15:25:39 command monit[29724]: 'meryl' stop: /etc/init.d/meryl
    Sep 26 15:25:39 command monit[29724]: 'meryl' start: /bin/bash
    Sep 26 15:26:47 command monit[29724]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=0.4%]
    
    11-12 successful checks must have took place here, then a failure , then restart - - No, it should be 5 failures and then a restart  as counter was reset 
    
    Sep 26 15:40:20 command monit[29724]: 'meryl' total mem amount of 14.5% matches resource limit [total mem amount>2.0%]
    Sep 26 15:40:20 command monit[29724]: 'meryl' trying to restart
    Sep 26 15:40:20 command monit[29724]: 'meryl' stop: /etc/init.d/meryl
    Sep 26 15:40:21 command monit[29724]: 'meryl' start: /bin/bash
    Sep 26 15:41:29 command monit[29724]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=0.3%]
    

    It looks to me like it is still broken

    cheers Gordon

  2. Tildeslash repo owner

    Hello,

    the fix in 5.9 works as expected, but it doesn't log the partial errors in "for-X-cycles" condition in debug mode after first error recovery. We have fixed the debug mode and used the following configuration with test program, which increases its memory allocation every second:

    set daemon 5
    set httpd port 2812 allow localhost
    
    check process test matching "memorytest"
            start program = "/bin/bash -c '/home/test/memorytest &'"
            stop program = "/usr/bin/pkill memorytest"
            if total memory usage > 1 MB for 5 cycles then restart
    

    Debug output (excerpt related to memory usage and triggered restart action):

    'test' process is not running
    'test' trying to restart
    'test' started
    'test' process is running with pid 18009
    'test' total mem amount check succeeded [current total mem amount=720.0 kB]
    'test' total mem amount check succeeded [current total mem amount=720.0 kB]
    'test' total mem amount of 1.1 MB matches resource limit [total mem amount>1024.0 kB]
    'test' total mem amount of 1.2 MB matches resource limit [total mem amount>1024.0 kB]
    'test' total mem amount of 1.2 MB matches resource limit [total mem amount>1024.0 kB]
    'test' total mem amount of 1.3 MB matches resource limit [total mem amount>1024.0 kB]
    'test' total mem amount of 1.3 MB matches resource limit [total mem amount>1024.0 kB]
    'test' trying to restart
    'test' stopped
    'test' started
    'test' process is running with pid 18013
    'test' total mem amount check succeeded [current total mem amount=716.0 kB]
    'test' total mem amount check succeeded [current total mem amount=788.0 kB]
    'test' total mem amount of 1.1 MB matches resource limit [total mem amount>1024.0 kB]
    'test' total mem amount of 1.1 MB matches resource limit [total mem amount>1024.0 kB]
    'test' total mem amount of 1.2 MB matches resource limit [total mem amount>1024.0 kB]
    'test' total mem amount of 1.2 MB matches resource limit [total mem amount>1024.0 kB]
    'test' total mem amount of 1.3 MB matches resource limit [total mem amount>1024.0 kB]
    'test' trying to restart
    'test' stopped
    'test' started
    'test' process is running with pid 18016
    

    The 5.9 works the same, but the debug mode is silent as mentioned.

    You can get snapshot of the development version (with debug fix) here: https://bitbucket.org/tildeslash/monit/get/master.tar.gz

    To compile:

    tar -xzf master.tar.gz
    cd tildeslash*
    ./bootstrap
    ./configure
    make
    

    Please can you test it and confirm if it works for you?

  3. Gordon Cook reporter

    Hi, I have tested the 5.10_devel version and added the "-v" flag to get full verbose option and it now looks like it is working as I expected.

    Oct 17 04:14:53 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.5%] 
    Oct 17 04:15:56 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.5%] 
    Oct 17 04:16:59 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] 
    Oct 17 04:18:01 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] 
    Oct 17 04:19:04 command monit[25828]: 'meryl' total mem amount of 9.0% matches resource limit [total mem amount>2.0%] 
    Oct 17 04:20:07 command monit[25828]: 'meryl' total mem amount of 8.1% matches resource limit [total mem amount>2.0%] 
    Oct 17 04:21:09 command monit[25828]: 'meryl' total mem amount of 14.5% matches resource limit [total mem amount>2.0%] 
    Oct 17 04:22:12 command monit[25828]: 'meryl' total mem amount of 14.1% matches resource limit [total mem amount>2.0%] 
    Oct 17 04:23:14 command monit[25828]: 'meryl' total mem amount of 13.9% matches resource limit [total mem amount>2.0%] 
    Oct 17 04:23:14 command monit[25828]: 'meryl' trying to restart 
    Oct 17 04:24:23 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.3%] 
    Oct 17 04:25:25 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.4%] 
    ...
    Oct 17 04:31:41 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] 
    Oct 17 04:32:43 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] 
    Oct 17 04:33:46 command monit[25828]: 'meryl' total mem amount of 9.5% matches resource limit [total mem amount>2.0%] 
    Oct 17 04:34:49 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] 
    Oct 17 04:35:52 command monit[25828]: 'meryl' total mem amount of 15.1% matches resource limit [total mem amount>2.0%] 
    Oct 17 04:36:54 command monit[25828]: 'meryl' total mem amount of 11.4% matches resource limit [total mem amount>2.0%] 
    Oct 17 04:37:57 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] 
    Oct 17 04:38:59 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] 
    ...
    Oct 17 05:26:57 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] 
    Oct 17 05:28:00 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.5%] 
    Oct 17 05:29:03 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.7%] 
    Oct 17 05:30:05 command monit[25828]: 'meryl' total mem amount of 9.3% matches resource limit [total mem amount>2.0%] 
    Oct 17 05:31:08 command monit[25828]: 'meryl' total mem amount of 14.3% matches resource limit [total mem amount>2.0%] 
    Oct 17 05:32:10 command monit[25828]: 'meryl' total mem amount of 14.0% matches resource limit [total mem amount>2.0%] 
    Oct 17 05:33:13 command monit[25828]: 'meryl' total mem amount of 14.2% matches resource limit [total mem amount>2.0%] 
    Oct 17 05:34:16 command monit[25828]: 'meryl' total mem amount of 14.2% matches resource limit [total mem amount>2.0%] 
    Oct 17 05:34:16 command monit[25828]: 'meryl' trying to restart 
    Oct 17 05:35:24 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.4%] 
    Oct 17 05:36:27 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.4%] 
    Oct 17 05:37:29 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.4%] 
    ...
    Oct 17 05:42:42 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] 
    Oct 17 05:43:46 command monit[25828]: 'meryl' total mem amount of 14.1% matches resource limit [total mem amount>2.0%] 
    Oct 17 05:44:49 command monit[25828]: 'meryl' total mem amount of 14.1% matches resource limit [total mem amount>2.0%] 
    Oct 17 05:45:51 command monit[25828]: 'meryl' total mem amount of 14.1% matches resource limit [total mem amount>2.0%] 
    Oct 17 05:46:54 command monit[25828]: 'meryl' total mem amount of 14.1% matches resource limit [total mem amount>2.0%] 
    Oct 17 05:47:56 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] 
    Oct 17 05:48:59 command monit[25828]: 'meryl' total mem amount of 8.8% matches resource limit [total mem amount>2.0%] 
    Oct 17 05:50:01 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=1.3%] 
    Oct 17 05:51:04 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.7%] 
    Oct 17 05:52:07 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.4%] 
    

    We only saw the need for a few restarts due to excessive load issues, while the others were just expected peak loads.

    Thanks for the fix,

    When can we expect it to be in an official release and which version will that be ?

    I also noticed strange values for memory in the output when not using "-v", sometimes it displayed huge numbers which I knew were incorrect and in debug mode it was at least consistent but seemed to display memory figures ~10MB less than I was seeing using pmap -x PID | tail -1.

    Normal mode (not using -v)

    Oct 16 15:36:34 command monit[16867]: 'meryl' total mem amount of 1.1 GB matches resource limit [total mem amount>47.0 MB] 
    Oct 16 15:37:37 command monit[16867]: 'meryl' total mem amount of 272.0 MB matches resource limit [total mem amount>47.0 MB] 
    Oct 16 15:38:39 command monit[16867]: 'meryl' total mem amount of 916.6 MB matches resource limit [total mem amount>47.0 MB] 
    Oct 16 15:39:42 command monit[16867]: 'meryl' total mem amount of 54.3 MB matches resource limit [total mem amount>47.0 MB] 
    Oct 16 15:40:44 command monit[16867]: 'meryl' total mem amount of 108.1 MB matches resource limit [total mem amount>47.0 MB] 
    

    Debug mode (using -v)

    Oct 16 16:25:37 command monit[13428]: 'meryl' total mem amount of 39.1 MB matches resource limit [total mem amount>37.0 MB] 
    Oct 16 16:26:40 command monit[13428]: 'meryl' total mem amount of 39.1 MB matches resource limit [total mem amount>37.0 MB] 
    Oct 16 16:27:42 command monit[13428]: 'meryl' total mem amount of 39.1 MB matches resource limit [total mem amount>37.0 MB] 
    Oct 16 16:28:45 command monit[13428]: 'meryl' total mem amount of 39.1 MB matches resource limit [total mem amount>37.0 MB] 
    Oct 16 16:29:48 command monit[13428]: 'meryl' total mem amount of 39.1 MB matches resource limit [total mem amount>37.0 MB] 
    

    Normal operation using pmap was between 47MB and 200 MB....so not sure how you get your memory figures in this 5.10_devel version. Do you want me to raise an issue ?

    cheers

    PS I did try it again with my 5.9 version with the -v option and that still failed to reset counter.

  4. Tildeslash repo owner

    Thanks for update. As mentioned the problem is really fixed in Monit 5.9 already - just the messages were confusing (not logging the partial errors). The fix for partial error logging will be part of 5.10.

    Regarding the mentioned memory values - you use total memory test ... in this case Monit sums the memory usage (RSS) by the process and all its children - currently it doesn't deal with shared memory though, so the shared RSS part is counted for each child and the total memory may exceed system memory. This problem will be fixed in the future. If you want to check just the memory used by the process itself (will correspond to pmap output), then use just memory test (not "total memory").

  5. Aniruddha. A

    Hello,

    I think I am hitting a similar issue with 5.6,

    I have a check like:

    check process myproc with pidfile /var/run/myproc.pid
       start program = "/etc/init.d/myservice restart"
       stop program  = "/etc/init.d/myservice stop"
       if 4 restarts within 10 cycles then exec "/handle/restart.sh"  <-- this also stops and starts monit
    

    Now, if there are 4 restarts, the script is triggered, but then if I manually restart the service, still the tracking seems to be on, and monit immediately invokes the script again

    error    : 'myproc' service restarted 5 times within 8 cycles(s) - exec
    info     : 'myproc' exec: /handle/restart.sh
    info     : 'myproc' process is running after previous restart timeout (manually recovered?)
    ...
    error    : 'myproc' service restarted 5 times within 9 cycles(s) - exec
    info     : 'myproc' exec: /handle/restart.sh
    info     : 'myproc' process is running after previous restart timeout (manually recovered?)
    

    Is there any way to reset the restart-count ?

    $ monit -V
    This is Monit version 5.6
    
  6. Derek Wuelfrath

    Is it possible that I'm facing this issue with version 5.19 ? I have the following check

    if failed host 127.0.0.1 port 18120 type udp protocol radius secret testing123 for 3 cycles then alert
    

    It is working just fine but as soon as it is being triggered, the "for 3 cycles" is no longer respected and it is alerting every failed connection.

    set daemon 2
    
    Nov  7 12:12:28 poulicheFencer monit: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable
    Nov  7 12:12:28 poulicheFencer monit[22663]: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable
    Nov  7 12:12:31 poulicheFencer monit: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable
    Nov  7 12:12:31 poulicheFencer monit[22663]: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable
    Nov  7 12:12:44 poulicheFencer monit: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable
    Nov  7 12:12:44 poulicheFencer monit[22663]: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable
    Nov  7 12:12:46 poulicheFencer monit: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable
    Nov  7 12:12:46 poulicheFencer monit[22663]: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable
    Nov  7 12:12:49 poulicheFencer monit: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable
    Nov  7 12:12:49 poulicheFencer monit[22663]: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable
    Nov  7 12:12:51 poulicheFencer monit: 'packetfence-radiusd' connection succeeded to [127.0.0.1]:18120 [UDP/IP]
    Nov  7 12:12:51 poulicheFencer monit[22663]: 'packetfence-radiusd' connection succeeded to [127.0.0.1]:18120 [UDP/IP]
    Nov  7 12:13:09 poulicheFencer monit: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable
    Nov  7 12:13:09 poulicheFencer monit[22663]: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable
    Nov  7 12:13:11 poulicheFencer monit: 'packetfence-radiusd' connection succeeded to [127.0.0.1]:18120 [UDP/IP]
    Nov  7 12:13:11 poulicheFencer monit[22663]: 'packetfence-radiusd' connection succeeded to [127.0.0.1]:18120 [UDP/IP]
    
  7. Log in to comment