- changed component to 1. Monit
monit does not reset counter once application has successfully restarted
Configuration
check process meryl with pidfile /var/run/meryl.pid
start program = "/bin/bash -c '/bin/sleep 5 && /etc/init.d/meryl start'" with timeout 60 seconds
stop program = "/etc/init.d/meryl stop"
restart program = "/etc/init.d/meryl restart"
# 5% for meryl, meryl has a fixed 50MB + size of data centre so use percentage
#if totalmem > 5% for 5 times within 5 cycles then restart # eating up memory?
if totalmem > 35 MB for 5 cycles then restart # eating up memory?
if totalcpu > 12% for 5 times within 5 cycles then restart # hung process?
group server
This is what monit does
Jun 23 15:19:44 command monit[16201]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=28852kB]
Jun 23 15:22:01 command monit[20793]: 'meryl' total mem amount of 94020kB matches resource limit [total mem amount>35840kB]
Jun 23 15:31:22 command monit[20793]: 'meryl' total mem amount of 44584kB matches resource limit [total mem amount>35840kB]
Jun 23 15:32:26 command monit[20793]: 'meryl' total mem amount of 804332kB matches resource limit [total mem amount>35840kB]
Jun 23 15:33:30 command monit[20793]: 'meryl' total mem amount of 1070312kB matches resource limit [total mem amount>35840kB]
Jun 23 15:34:34 command monit[20793]: 'meryl' total mem amount of 1097624kB matches resource limit [total mem amount>35840kB]
Jun 23 15:35:19 command monit[20793]: 'meryl' total mem amount of 46404kB matches resource limit [total mem amount>35840kB]
Jun 23 15:35:19 command monit[20793]: 'meryl' trying to restart
Jun 23 15:35:19 command monit[20793]: 'meryl' stop: /etc/init.d/meryl
Jun 23 15:35:19 command monit[20793]: 'meryl' start: /bin/bash
Jun 23 15:36:30 command monit[20793]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=29488kB]
Jun 23 15:47:30 command monit[20793]: 'meryl' total mem amount of 51128kB matches resource limit [total mem amount>35840kB]
Jun 23 15:47:30 command monit[20793]: 'meryl' trying to restart
Jun 23 15:47:30 command monit[20793]: 'meryl' stop: /etc/init.d/meryl
Jun 23 15:47:30 command monit[20793]: 'meryl' start: /bin/bash
Jun 23 15:48:40 command monit[20793]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=24720kB]
Jun 23 15:58:37 command monit[20793]: 'meryl' total mem amount of 923192kB matches resource limit [total mem amount>35840kB]
Jun 23 15:58:37 command monit[20793]: 'meryl' trying to restart
Jun 23 15:58:37 command monit[20793]: 'meryl' stop: /etc/init.d/meryl
Jun 23 15:58:37 command monit[20793]: 'meryl' start: /bin/bash
Jun 23 15:59:47 command monit[20793]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=29560kB]
Jun 23 16:40:32 command monit[20793]: 'meryl' total mem amount of 51572kB matches resource limit [total mem amount>35840kB]
Jun 23 16:40:32 command monit[20793]: 'meryl' trying to restart
Jun 23 16:40:32 command monit[20793]: 'meryl' stop: /etc/init.d/meryl
Jun 23 16:40:32 command monit[20793]: 'meryl' start: /bin/bash
Jun 23 16:41:44 command monit[20793]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=24752kB]
Jun 23 17:07:45 command monit[20793]: 'meryl' total mem amount of 920836kB matches resource limit [total mem amount>35840kB]
Jun 23 17:07:45 command monit[20793]: 'meryl' trying to restart
Jun 23 17:07:45 command monit[20793]: 'meryl' stop: /etc/init.d/meryl
Jun 23 17:07:45 command monit[20793]: 'meryl' start: /bin/bash
Jun 23 17:08:56 command monit[20793]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=22284kB]
command[CT02]:>
never resets counter on the successful check
Now it is broken it restarts every fail, rather than every 5 fails in a row.
Lets check it is the same for old configuration.
Tried with
- if totalmem > 35 MB for 3 times within 5 cycles then restart
- if totalmem > 35 MB for 5 times within 5 cycles then restart
- if totalmem > 35 MB for 5 cycles then restart
Tried with * 5.5 * 5.6 * 5.8.1 All Ubuntu binaries on a Debian system, which installs and runs without any other issues.
Comments (20)
-
repo owner -
repo owner - changed component to Monit
-
reporter Been a few months now and wondered if we were going to see a fix in 5.8.1 to try out for you....confirm its working as expected.
-
repo owner - changed status to resolved
Hello,
the problem is fixed in the development version.
You can test the development branch if you want to, you can get snapshot here: https://bitbucket.org/tildeslash/monit/get/master.tar.gz
To compile:
tar -xzf master.tar.gz cd tildeslash* ./bootstrap ./configure make
-
reporter Thanks I will do that, so its in 5.8.1 dev, when will you be officially releasing 5.8.1 ?
-
repo owner 5.8.1 is the current production release ... the next (development) release is 5.9, which will be available ca. within 1 week.
-
reporter Hi tildeslash,
Just thought I would download and test 5.9, does not seem to be fixed in my case......still restarts it after 1 failure after the success. I assumed a success test was resetting the count, so that it took 5 more before it restarted but seems to just restart it again.....
if totalmem > 2% for 5 times within 5 cycles then restart # eating up memory? Sep 26 15:07:49 command monit[29724]: 'meryl' total mem amount of 8.1% matches resource limit [total mem amount>2.0%] Sep 26 15:08:51 command monit[29724]: 'meryl' total mem amount of 10.6% matches resource limit [total mem amount>2.0%] Sep 26 15:09:54 command monit[29724]: 'meryl' total mem amount of 14.1% matches resource limit [total mem amount>2.0%] Sep 26 15:10:56 command monit[29724]: 'meryl' total mem amount of 14.4% matches resource limit [total mem amount>2.0%] Sep 26 15:11:59 command monit[29724]: 'meryl' total mem amount of 14.4% matches resource limit [total mem amount>2.0%] Sep 26 15:11:59 command monit[29724]: 'meryl' trying to restart Sep 26 15:11:59 command monit[29724]: 'meryl' stop: /etc/init.d/meryl Sep 26 15:11:59 command monit[29724]: 'meryl' start: /bin/bash Sep 26 15:13:07 command monit[29724]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=0.4%] 7-8 successful checks must have took place here, then a failure , then restart - No, it should be 5 failures then a restart as counter was reset Sep 26 15:25:39 command monit[29724]: 'meryl' total mem amount of 14.5% matches resource limit [total mem amount>2.0%] Sep 26 15:25:39 command monit[29724]: 'meryl' trying to restart Sep 26 15:25:39 command monit[29724]: 'meryl' stop: /etc/init.d/meryl Sep 26 15:25:39 command monit[29724]: 'meryl' start: /bin/bash Sep 26 15:26:47 command monit[29724]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=0.4%] 11-12 successful checks must have took place here, then a failure , then restart - - No, it should be 5 failures and then a restart as counter was reset Sep 26 15:40:20 command monit[29724]: 'meryl' total mem amount of 14.5% matches resource limit [total mem amount>2.0%] Sep 26 15:40:20 command monit[29724]: 'meryl' trying to restart Sep 26 15:40:20 command monit[29724]: 'meryl' stop: /etc/init.d/meryl Sep 26 15:40:21 command monit[29724]: 'meryl' start: /bin/bash Sep 26 15:41:29 command monit[29724]: 'meryl' 'meryl' total mem amount check succeeded [current total mem amount=0.3%]
It looks to me like it is still broken
cheers Gordon
-
repo owner - changed status to open
-
repo owner - changed version to 5.9
-
reporter Hi
Have you a new build with this fix we can test ?
cheers
-
repo owner Hello,
the fix in 5.9 works as expected, but it doesn't log the partial errors in "for-X-cycles" condition in debug mode after first error recovery. We have fixed the debug mode and used the following configuration with test program, which increases its memory allocation every second:
set daemon 5 set httpd port 2812 allow localhost check process test matching "memorytest" start program = "/bin/bash -c '/home/test/memorytest &'" stop program = "/usr/bin/pkill memorytest" if total memory usage > 1 MB for 5 cycles then restart
Debug output (excerpt related to memory usage and triggered restart action):
'test' process is not running 'test' trying to restart 'test' started 'test' process is running with pid 18009 'test' total mem amount check succeeded [current total mem amount=720.0 kB] 'test' total mem amount check succeeded [current total mem amount=720.0 kB] 'test' total mem amount of 1.1 MB matches resource limit [total mem amount>1024.0 kB] 'test' total mem amount of 1.2 MB matches resource limit [total mem amount>1024.0 kB] 'test' total mem amount of 1.2 MB matches resource limit [total mem amount>1024.0 kB] 'test' total mem amount of 1.3 MB matches resource limit [total mem amount>1024.0 kB] 'test' total mem amount of 1.3 MB matches resource limit [total mem amount>1024.0 kB] 'test' trying to restart 'test' stopped 'test' started 'test' process is running with pid 18013 'test' total mem amount check succeeded [current total mem amount=716.0 kB] 'test' total mem amount check succeeded [current total mem amount=788.0 kB] 'test' total mem amount of 1.1 MB matches resource limit [total mem amount>1024.0 kB] 'test' total mem amount of 1.1 MB matches resource limit [total mem amount>1024.0 kB] 'test' total mem amount of 1.2 MB matches resource limit [total mem amount>1024.0 kB] 'test' total mem amount of 1.2 MB matches resource limit [total mem amount>1024.0 kB] 'test' total mem amount of 1.3 MB matches resource limit [total mem amount>1024.0 kB] 'test' trying to restart 'test' stopped 'test' started 'test' process is running with pid 18016
The 5.9 works the same, but the debug mode is silent as mentioned.
You can get snapshot of the development version (with debug fix) here: https://bitbucket.org/tildeslash/monit/get/master.tar.gz
To compile:
tar -xzf master.tar.gz cd tildeslash* ./bootstrap ./configure make
Please can you test it and confirm if it works for you?
-
repo owner - changed version to 5.8.1
-
reporter Hi, I have tested the 5.10_devel version and added the "-v" flag to get full verbose option and it now looks like it is working as I expected.
Oct 17 04:14:53 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.5%] Oct 17 04:15:56 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.5%] Oct 17 04:16:59 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] Oct 17 04:18:01 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] Oct 17 04:19:04 command monit[25828]: 'meryl' total mem amount of 9.0% matches resource limit [total mem amount>2.0%] Oct 17 04:20:07 command monit[25828]: 'meryl' total mem amount of 8.1% matches resource limit [total mem amount>2.0%] Oct 17 04:21:09 command monit[25828]: 'meryl' total mem amount of 14.5% matches resource limit [total mem amount>2.0%] Oct 17 04:22:12 command monit[25828]: 'meryl' total mem amount of 14.1% matches resource limit [total mem amount>2.0%] Oct 17 04:23:14 command monit[25828]: 'meryl' total mem amount of 13.9% matches resource limit [total mem amount>2.0%] Oct 17 04:23:14 command monit[25828]: 'meryl' trying to restart Oct 17 04:24:23 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.3%] Oct 17 04:25:25 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.4%] ... Oct 17 04:31:41 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] Oct 17 04:32:43 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] Oct 17 04:33:46 command monit[25828]: 'meryl' total mem amount of 9.5% matches resource limit [total mem amount>2.0%] Oct 17 04:34:49 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] Oct 17 04:35:52 command monit[25828]: 'meryl' total mem amount of 15.1% matches resource limit [total mem amount>2.0%] Oct 17 04:36:54 command monit[25828]: 'meryl' total mem amount of 11.4% matches resource limit [total mem amount>2.0%] Oct 17 04:37:57 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] Oct 17 04:38:59 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] ... Oct 17 05:26:57 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] Oct 17 05:28:00 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.5%] Oct 17 05:29:03 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.7%] Oct 17 05:30:05 command monit[25828]: 'meryl' total mem amount of 9.3% matches resource limit [total mem amount>2.0%] Oct 17 05:31:08 command monit[25828]: 'meryl' total mem amount of 14.3% matches resource limit [total mem amount>2.0%] Oct 17 05:32:10 command monit[25828]: 'meryl' total mem amount of 14.0% matches resource limit [total mem amount>2.0%] Oct 17 05:33:13 command monit[25828]: 'meryl' total mem amount of 14.2% matches resource limit [total mem amount>2.0%] Oct 17 05:34:16 command monit[25828]: 'meryl' total mem amount of 14.2% matches resource limit [total mem amount>2.0%] Oct 17 05:34:16 command monit[25828]: 'meryl' trying to restart Oct 17 05:35:24 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.4%] Oct 17 05:36:27 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.4%] Oct 17 05:37:29 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.4%] ... Oct 17 05:42:42 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] Oct 17 05:43:46 command monit[25828]: 'meryl' total mem amount of 14.1% matches resource limit [total mem amount>2.0%] Oct 17 05:44:49 command monit[25828]: 'meryl' total mem amount of 14.1% matches resource limit [total mem amount>2.0%] Oct 17 05:45:51 command monit[25828]: 'meryl' total mem amount of 14.1% matches resource limit [total mem amount>2.0%] Oct 17 05:46:54 command monit[25828]: 'meryl' total mem amount of 14.1% matches resource limit [total mem amount>2.0%] Oct 17 05:47:56 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.6%] Oct 17 05:48:59 command monit[25828]: 'meryl' total mem amount of 8.8% matches resource limit [total mem amount>2.0%] Oct 17 05:50:01 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=1.3%] Oct 17 05:51:04 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.7%] Oct 17 05:52:07 command monit[25828]: 'meryl' total mem amount check succeeded [current total mem amount=0.4%]
We only saw the need for a few restarts due to excessive load issues, while the others were just expected peak loads.
Thanks for the fix,
When can we expect it to be in an official release and which version will that be ?
I also noticed strange values for memory in the output when not using "-v", sometimes it displayed huge numbers which I knew were incorrect and in debug mode it was at least consistent but seemed to display memory figures ~10MB less than I was seeing using
pmap -x PID | tail -1
.Normal mode (not using -v)
Oct 16 15:36:34 command monit[16867]: 'meryl' total mem amount of 1.1 GB matches resource limit [total mem amount>47.0 MB] Oct 16 15:37:37 command monit[16867]: 'meryl' total mem amount of 272.0 MB matches resource limit [total mem amount>47.0 MB] Oct 16 15:38:39 command monit[16867]: 'meryl' total mem amount of 916.6 MB matches resource limit [total mem amount>47.0 MB] Oct 16 15:39:42 command monit[16867]: 'meryl' total mem amount of 54.3 MB matches resource limit [total mem amount>47.0 MB] Oct 16 15:40:44 command monit[16867]: 'meryl' total mem amount of 108.1 MB matches resource limit [total mem amount>47.0 MB]
Debug mode (using -v)
Oct 16 16:25:37 command monit[13428]: 'meryl' total mem amount of 39.1 MB matches resource limit [total mem amount>37.0 MB] Oct 16 16:26:40 command monit[13428]: 'meryl' total mem amount of 39.1 MB matches resource limit [total mem amount>37.0 MB] Oct 16 16:27:42 command monit[13428]: 'meryl' total mem amount of 39.1 MB matches resource limit [total mem amount>37.0 MB] Oct 16 16:28:45 command monit[13428]: 'meryl' total mem amount of 39.1 MB matches resource limit [total mem amount>37.0 MB] Oct 16 16:29:48 command monit[13428]: 'meryl' total mem amount of 39.1 MB matches resource limit [total mem amount>37.0 MB]
Normal operation using pmap was between 47MB and 200 MB....so not sure how you get your memory figures in this 5.10_devel version. Do you want me to raise an issue ?
cheers
PS I did try it again with my 5.9 version with the -v option and that still failed to reset counter.
-
repo owner Thanks for update. As mentioned the problem is really fixed in Monit 5.9 already - just the messages were confusing (not logging the partial errors). The fix for partial error logging will be part of 5.10.
Regarding the mentioned memory values - you use total memory test ... in this case Monit sums the memory usage (RSS) by the process and all its children - currently it doesn't deal with shared memory though, so the shared RSS part is counted for each child and the total memory may exceed system memory. This problem will be fixed in the future. If you want to check just the memory used by the process itself (will correspond to pmap output), then use just memory test (not "total memory").
-
repo owner - changed status to resolved
fixed in monit 5.9, error messages for partial errors will be fixed in monit 5.10
-
Hello,
I think I am hitting a similar issue with 5.6,
I have a check like:
check process myproc with pidfile /var/run/myproc.pid start program = "/etc/init.d/myservice restart" stop program = "/etc/init.d/myservice stop" if 4 restarts within 10 cycles then exec "/handle/restart.sh" <-- this also stops and starts monit
Now, if there are 4 restarts, the script is triggered, but then if I manually restart the service, still the tracking seems to be on, and monit immediately invokes the script again
error : 'myproc' service restarted 5 times within 8 cycles(s) - exec info : 'myproc' exec: /handle/restart.sh info : 'myproc' process is running after previous restart timeout (manually recovered?) ... error : 'myproc' service restarted 5 times within 9 cycles(s) - exec info : 'myproc' exec: /handle/restart.sh info : 'myproc' process is running after previous restart timeout (manually recovered?)
Is there any way to reset the restart-count ?
$ monit -V This is Monit version 5.6
-
repo owner @Ani_A please update to a recent Monit version
-
repo owner - removed version
Removing version: 5.8.1 (automated comment)
-
Is it possible that I'm facing this issue with version 5.19 ? I have the following check
#! if failed host 127.0.0.1 port 18120 type udp protocol radius secret testing123 for 3 cycles then alert
It is working just fine but as soon as it is being triggered, the "for 3 cycles" is no longer respected and it is alerting every failed connection.
#! set daemon 2
#! Nov 7 12:12:28 poulicheFencer monit: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable Nov 7 12:12:28 poulicheFencer monit[22663]: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable Nov 7 12:12:31 poulicheFencer monit: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable Nov 7 12:12:31 poulicheFencer monit[22663]: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable Nov 7 12:12:44 poulicheFencer monit: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable Nov 7 12:12:44 poulicheFencer monit[22663]: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable Nov 7 12:12:46 poulicheFencer monit: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable Nov 7 12:12:46 poulicheFencer monit[22663]: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable Nov 7 12:12:49 poulicheFencer monit: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable Nov 7 12:12:49 poulicheFencer monit[22663]: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable Nov 7 12:12:51 poulicheFencer monit: 'packetfence-radiusd' connection succeeded to [127.0.0.1]:18120 [UDP/IP] Nov 7 12:12:51 poulicheFencer monit[22663]: 'packetfence-radiusd' connection succeeded to [127.0.0.1]:18120 [UDP/IP] Nov 7 12:13:09 poulicheFencer monit: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable Nov 7 12:13:09 poulicheFencer monit[22663]: 'packetfence-radiusd' failed protocol test [RADIUS] at [127.0.0.1]:18120 [UDP/IP] -- RADIUS: error receiving response -- Resource temporarily unavailable Nov 7 12:13:11 poulicheFencer monit: 'packetfence-radiusd' connection succeeded to [127.0.0.1]:18120 [UDP/IP] Nov 7 12:13:11 poulicheFencer monit[22663]: 'packetfence-radiusd' connection succeeded to [127.0.0.1]:18120 [UDP/IP]
-
It doesn’t seem to work also on 5.25.2 .
When a check executes the action it then doesn’t reset the counter and each next check iteration that fails then again executes the action; not desirable when managing an app that takes some time to initialize.
My workaround is to do a manual restart, then unmonitor to prevent false restarts while it's initializing, then wait for it to initialize and then monitor it. This is run with nohup as otherwise the script gets killed after the stop command:
execute "/usr/bin/nohup /bin/bash -c '(date ; /usr/bin/monit stop eventstore ; sleep 60 ; /usr/bin/monit start eventstore ; sleep 120 ; echo unmonitor eventstore to disable checks while it is initializing ; /usr/bin/monit unmonitor eventstore ; echo waiting for eventstore to initialize and then monitor it again ; while ! curl http://127.0.0.1:2113/info | grep -P master\|slave ; do echo sleeping ; sleep 60 ; done ; /usr/bin/monit monitor eventstore) >> /var/log/monit.eventstore_restart.log' "
- Log in to comment