tildeslash / Monit / issues / #210 - "if cpu usage (wait) > 80% for 2 cycles then alert" sent "limit succeeded" message only

Issue #210 new

Ulrich Windl created an issue 2015-06-05

Obviously when the condition "cpu usage (wait) > 80%" existed for only one cycle, no "limit matched" message is sent, but a "limit succeed" message is sent when the wait time had dropped. To the recepient of the message this is a little bit confusing.

Comments (6)

Tildeslash repo owner
I'm unable to reproduce the issue. Using the following configuration (with alert target and mailserver set - not part of the configuration snip):
```
check system $HOST
    if cpu usage > 2% for 2 cycles then alert
```
Started monit on idle machine, then rose the cpu usage above the level and stopped the activity again:
```
'trilobite' cpu usage check succeeded [current cpu usage=0.4%]
'trilobite' cpu usage check succeeded [current cpu usage=1.6%]
'trilobite' cpu usage of 2.2% matches resource limit [cpu usage<2.0%]
'trilobite.local' cpu usage check succeeded [current cpu usage=0.6%]
'trilobite.local' cpu usage check succeeded [current cpu usage=0.5%]
'trilobite.local' cpu usage check succeeded [current cpu usage=0.4%]
```
No "succeeded" alert was delivered if the limit was exceeded only in one cycle.

Please can you check monit logs? It seems the limit may have be exceeded, but the alert message was not delivered (rejected?)
- 2015-06-06T11:41:12+00:00
Ulrich Windl reporter
I have these lines extracted from syslog arounf the problem:

Jun 3 23:20:39 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.2 matches resource limit [loadavg(15min)>2.0] [...] Jun 3 23:24:40 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.4 matches resource limit [loadavg(15min)>2.0] Jun 3 23:26:41 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.5 matches resource limit [loadavg(15min)>2.0] Jun 3 23:28:42 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.5 matches resource limit [loadavg(15min)>2.0] Jun 3 23:30:43 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.6 matches resource limit [loadavg(15min)>2.0] Jun 3 23:32:43 v05 monit[20210]: 'v05.local' cpu wait usage of 81.2% matches resource limit [cpu wait usage>80.0%] Jun 3 23:32:43 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.7 matches resource limit [loadavg(15min)>2.0] Jun 3 23:34:44 v05 monit[20210]: 'v05.local' cpu wait usage check succeeded [current cpu wait usage=68.9%] Jun 3 23:34:44 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.7 matches resource limit [loadavg(15min)>2.0] Jun 3 23:36:45 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.8 matches resource limit [loadavg(15min)>2.0] Jun 3 23:38:46 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.8 matches resource limit [loadavg(15min)>2.0]

During this interval I received two messages: #1 at 23:20 #2 at 23:34 There were no other messages from monit in syslog, The local mail server logged connections at 23:20:39 and 23:34:44
- 2015-06-08T09:00:12+00:00
Tildeslash repo owner
There are only two "cpu wait" related message in the snip - at 23:32 and 23:34:
```
Jun 3 23:32:43 v05 monit[20210]: 'v05.local' cpu wait usage of 81.2% matches resource limit [cpu wait usage>80.0%] 
Jun 3 23:34:44 v05 monit[20210]: 'v05.local' cpu wait usage check succeeded [current cpu wait usage=68.9%]
```
no "cpu wait" error messages at 23:20, just "loadavg" related errors.

Please can you send your monit configuration for "check system" and the content of the error messages from 23:20 and 23:34?
- 2015-06-10T09:50:43+00:00
Ulrich Windl reporter
monitrc:

############################################################################### ## Monit control file ############################################################################### ## Start Monit in the background (run as a daemon): # set daemon 120 # check services at 2-minute intervals set logfile syslog set idfile /var/lib/monit/monit.id set statefile /var/lib/monit/monit.state #... ############################################################################### ## Services ############################################################################### ## ## Check general system resources such as load average, cpu and memory ## usage. Each test specifies a resource, conditions and the action to be ## performed should a test fail. # check system v05.local if loadavg (1min) > 8 then alert if loadavg (5min) > 4 then alert if loadavg (15min) > 2 then alert if memory usage > 90% for 2 cycles then alert if swap usage > 25% for 2 cycles then alert if swap usage > 50% then alert if cpu usage (user) > 90% for 30 cycles then alert if cpu usage (system) > 20% for 2 cycles then alert if cpu usage (wait) > 80% for 2 cycles then alert group server -- mails: -- Resource limit matched Service v05.local
```
Date:        Wed, 03 Jun 2015 23:20:39
Action:      alert
Host:        v05.local
Description: loadavg(15min) of 2.2 matches resource limit [loadavg(15min)>2.0]
```
Your faithful employee, Monit -- Resource limit succeeded Service v05.local
```
Date:        Wed, 03 Jun 2015 23:34:44
Action:      alert
Host:        v05.local
Description: cpu wait usage check succeeded [current cpu wait usage=68.9%]
```
Your faithful employee, Monit --
- 2015-06-10T10:50:11+00:00
Tildeslash repo owner
- assigned issue to
  
  Tildeslash
- 2015-10-22T11:12:32+00:00
Tildeslash repo owner
- removed version
Removing version: 5.13 (automated comment)
- 2016-06-19T18:47:46+00:00
Log in to comment

Assignee: Tildeslash

Type: bug

Priority: minor

Status: new

Component: Monit

Version: –

Votes: 0

Watchers: 2

Comments (6)

monitrc: