"if cpu usage (wait) > 80% for 2 cycles then alert" sent "limit succeeded" message only
Obviously when the condition "cpu usage (wait) > 80%" existed for only one cycle, no "limit matched" message is sent, but a "limit succeed" message is sent when the wait time had dropped. To the recepient of the message this is a little bit confusing.
Comments (6)
-
repo owner -
reporter I have these lines extracted from syslog arounf the problem:
Jun 3 23:20:39 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.2 matches resource limit [loadavg(15min)>2.0] [...] Jun 3 23:24:40 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.4 matches resource limit [loadavg(15min)>2.0] Jun 3 23:26:41 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.5 matches resource limit [loadavg(15min)>2.0] Jun 3 23:28:42 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.5 matches resource limit [loadavg(15min)>2.0] Jun 3 23:30:43 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.6 matches resource limit [loadavg(15min)>2.0] Jun 3 23:32:43 v05 monit[20210]: 'v05.local' cpu wait usage of 81.2% matches resource limit [cpu wait usage>80.0%] Jun 3 23:32:43 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.7 matches resource limit [loadavg(15min)>2.0] Jun 3 23:34:44 v05 monit[20210]: 'v05.local' cpu wait usage check succeeded [current cpu wait usage=68.9%] Jun 3 23:34:44 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.7 matches resource limit [loadavg(15min)>2.0] Jun 3 23:36:45 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.8 matches resource limit [loadavg(15min)>2.0] Jun 3 23:38:46 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.8 matches resource limit [loadavg(15min)>2.0]
During this interval I received two messages:
#1at 23:20#2at 23:34 There were no other messages from monit in syslog, The local mail server logged connections at 23:20:39 and 23:34:44 -
repo owner There are only two "cpu wait" related message in the snip - at 23:32 and 23:34:
Jun 3 23:32:43 v05 monit[20210]: 'v05.local' cpu wait usage of 81.2% matches resource limit [cpu wait usage>80.0%] Jun 3 23:34:44 v05 monit[20210]: 'v05.local' cpu wait usage check succeeded [current cpu wait usage=68.9%]
no "cpu wait" error messages at 23:20, just "loadavg" related errors.
Please can you send your monit configuration for "check system" and the content of the error messages from 23:20 and 23:34?
-
reporter monitrc:
############################################################################### ## Monit control file ############################################################################### ## Start Monit in the background (run as a daemon): # set daemon 120 # check services at 2-minute intervals set logfile syslog set idfile /var/lib/monit/monit.id set statefile /var/lib/monit/monit.state #... ############################################################################### ## Services ############################################################################### ## ## Check general system resources such as load average, cpu and memory ## usage. Each test specifies a resource, conditions and the action to be ## performed should a test fail. # check system v05.local if loadavg (1min) > 8 then alert if loadavg (5min) > 4 then alert if loadavg (15min) > 2 then alert if memory usage > 90% for 2 cycles then alert if swap usage > 25% for 2 cycles then alert if swap usage > 50% then alert if cpu usage (user) > 90% for 30 cycles then alert if cpu usage (system) > 20% for 2 cycles then alert if cpu usage (wait) > 80% for 2 cycles then alert group server -- mails: -- Resource limit matched Service v05.local
Date: Wed, 03 Jun 2015 23:20:39 Action: alert Host: v05.local Description: loadavg(15min) of 2.2 matches resource limit [loadavg(15min)>2.0]
Your faithful employee, Monit -- Resource limit succeeded Service v05.local
Date: Wed, 03 Jun 2015 23:34:44 Action: alert Host: v05.local Description: cpu wait usage check succeeded [current cpu wait usage=68.9%]
Your faithful employee, Monit --
-
repo owner -
assigned issue to
-
assigned issue to
-
repo owner - removed version
Removing version: 5.13 (automated comment)
- Log in to comment
I'm unable to reproduce the issue. Using the following configuration (with alert target and mailserver set - not part of the configuration snip):
Started monit on idle machine, then rose the cpu usage above the level and stopped the activity again:
No "succeeded" alert was delivered if the limit was exceeded only in one cycle.
Please can you check monit logs? It seems the limit may have be exceeded, but the alert message was not delivered (rejected?)