"if cpu usage (wait) > 80% for 2 cycles then alert" sent "limit succeeded" message only

Issue #210 new
Ulrich Windl created an issue

Obviously when the condition "cpu usage (wait) > 80%" existed for only one cycle, no "limit matched" message is sent, but a "limit succeed" message is sent when the wait time had dropped. To the recepient of the message this is a little bit confusing.

Comments (6)

  1. Tildeslash repo owner

    I'm unable to reproduce the issue. Using the following configuration (with alert target and mailserver set - not part of the configuration snip):

    check system $HOST
        if cpu usage > 2% for 2 cycles then alert
    

    Started monit on idle machine, then rose the cpu usage above the level and stopped the activity again:

    'trilobite' cpu usage check succeeded [current cpu usage=0.4%]
    'trilobite' cpu usage check succeeded [current cpu usage=1.6%]
    'trilobite' cpu usage of 2.2% matches resource limit [cpu usage<2.0%]
    'trilobite.local' cpu usage check succeeded [current cpu usage=0.6%]
    'trilobite.local' cpu usage check succeeded [current cpu usage=0.5%]
    'trilobite.local' cpu usage check succeeded [current cpu usage=0.4%]
    

    No "succeeded" alert was delivered if the limit was exceeded only in one cycle.

    Please can you check monit logs? It seems the limit may have be exceeded, but the alert message was not delivered (rejected?)

  2. Ulrich Windl reporter

    I have these lines extracted from syslog arounf the problem:

    Jun 3 23:20:39 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.2 matches resource limit [loadavg(15min)>2.0] [...] Jun 3 23:24:40 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.4 matches resource limit [loadavg(15min)>2.0] Jun 3 23:26:41 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.5 matches resource limit [loadavg(15min)>2.0] Jun 3 23:28:42 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.5 matches resource limit [loadavg(15min)>2.0] Jun 3 23:30:43 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.6 matches resource limit [loadavg(15min)>2.0] Jun 3 23:32:43 v05 monit[20210]: 'v05.local' cpu wait usage of 81.2% matches resource limit [cpu wait usage>80.0%] Jun 3 23:32:43 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.7 matches resource limit [loadavg(15min)>2.0] Jun 3 23:34:44 v05 monit[20210]: 'v05.local' cpu wait usage check succeeded [current cpu wait usage=68.9%] Jun 3 23:34:44 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.7 matches resource limit [loadavg(15min)>2.0] Jun 3 23:36:45 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.8 matches resource limit [loadavg(15min)>2.0] Jun 3 23:38:46 v05 monit[20210]: 'v05.local' loadavg(15min) of 2.8 matches resource limit [loadavg(15min)>2.0]

    During this interval I received two messages: #1 at 23:20 #2 at 23:34 There were no other messages from monit in syslog, The local mail server logged connections at 23:20:39 and 23:34:44

  3. Tildeslash repo owner

    There are only two "cpu wait" related message in the snip - at 23:32 and 23:34:

    Jun 3 23:32:43 v05 monit[20210]: 'v05.local' cpu wait usage of 81.2% matches resource limit [cpu wait usage>80.0%] 
    Jun 3 23:34:44 v05 monit[20210]: 'v05.local' cpu wait usage check succeeded [current cpu wait usage=68.9%]
    

    no "cpu wait" error messages at 23:20, just "loadavg" related errors.

    Please can you send your monit configuration for "check system" and the content of the error messages from 23:20 and 23:34?

  4. Ulrich Windl reporter

    monitrc:

    ############################################################################### ## Monit control file ############################################################################### ## Start Monit in the background (run as a daemon): # set daemon 120 # check services at 2-minute intervals set logfile syslog set idfile /var/lib/monit/monit.id set statefile /var/lib/monit/monit.state #... ############################################################################### ## Services ############################################################################### ## ## Check general system resources such as load average, cpu and memory ## usage. Each test specifies a resource, conditions and the action to be ## performed should a test fail. # check system v05.local if loadavg (1min) > 8 then alert if loadavg (5min) > 4 then alert if loadavg (15min) > 2 then alert if memory usage > 90% for 2 cycles then alert if swap usage > 25% for 2 cycles then alert if swap usage > 50% then alert if cpu usage (user) > 90% for 30 cycles then alert if cpu usage (system) > 20% for 2 cycles then alert if cpu usage (wait) > 80% for 2 cycles then alert group server -- mails: -- Resource limit matched Service v05.local

    Date:        Wed, 03 Jun 2015 23:20:39
    Action:      alert
    Host:        v05.local
    Description: loadavg(15min) of 2.2 matches resource limit [loadavg(15min)>2.0]
    

    Your faithful employee, Monit -- Resource limit succeeded Service v05.local

    Date:        Wed, 03 Jun 2015 23:34:44
    Action:      alert
    Host:        v05.local
    Description: cpu wait usage check succeeded [current cpu wait usage=68.9%]
    

    Your faithful employee, Monit --

  5. Log in to comment