Monit resilience testing failed

Issue #788 new
Prem Kumar created an issue

I am trying to do resilience testing between monit to Mmonit process by making network disconnect between Monit & Mmonit processes.

I have configured event queue directory & validating my Resource Metric information ( CPU, Memory.....etc) and Alarm events.

During this testing, I have observed no other events stored in my queue directory.

I have not received Resource and event details at Mmonit side.

How do we handle these cases??. Please suggest.

Below are my configuration details,

set daemon  30
set log syslog
set mmonit http://monit:monit@X.X.X.X:9096/collector
set eventqueue basedir /home/cogniz/monit-5.25.2/queue/
set httpd port 2812 and
    use address X.X.X.X
    allow X.X.X.X
    allow admin:monit
  check system X.X.X.X
    if loadavg (1min) > 40 then alert
    if loadavg (5min) > 30 then alert
    if cpu usage > 50% for 1 cycles then alert
    if memory usage > 75% then alert
    if swap usage > 25% then alert

check filesystem root with path /
       if space usage > 70% then alert
check filesystem home with path /home
       if space usage > 70% then alert
check network public with interface eth0
    if failed link then alert
    if changed link then alert
    if saturation > 90% then alert
    if download > 10 MB/s then alert
    if upload > 10000 packets/s then alert
    if total uploaded > 9999990 GB in last hour then alert
check process rsyslog with pidfile /home/cogniz/rsyslog/syslog.pid
    if cpu > 10% for 1 cycles then alert
    if cpu > 10% for 5 cycles then alert
    if totalmem > 200.0 MB for 5 cycles then alert
    if children > 250 then alert
    if loadavg(5min) greater than 10 for 8 cycles then alert
    if disk read > 500 kb/s for 10 cycles then alert
    if disk write > 500 kb/s for 10 cycles then alert

Logs:

[IST Nov  2 13:04:39] error    : Cannot create socket to [X.X.X.X]:9096 -- Connection refused
[IST Nov  2 13:04:39] error    : M/Monit: cannot open a connection to http://[X.X.X.X]:9096/collector
[IST Nov  2 13:04:40] error    : 'rsyslog' process is not running
[IST Nov  2 13:04:40] info     : 'rsyslog' trying to restart
[IST Nov  2 13:05:09] error    : Cannot create socket to [X.X.X.X]:9096 -- Connection refused
[IST Nov  2 13:05:09] error    : M/Monit: cannot open a connection to http://[X.X.X.X]:9096/collector
[IST Nov  2 13:05:10] error    : 'rsyslog' process is not running
[IST Nov  2 13:05:10] info     : 'rsyslog' trying to restart
[IST Nov  2 13:05:39] error    : Cannot create socket to [X.X.X.X]:9096 -- Connection refused
[IST Nov  2 13:05:39] error    : M/Monit: cannot open a connection to http://[X.X.X.X]:9096/collector
[IST Nov  2 13:05:40] error    : 'rsyslog' process is not running
[IST Nov  2 13:05:40] info     : 'rsyslog' trying to restart
[IST Nov  2 13:06:09] error    : Cannot create socket to [X.X.X.X]:9096 -- Connection refused
[IST Nov  2 13:06:09] error    : M/Monit: cannot open a connection to http://[X.X.X.X]:9096/collector
[IST Nov  2 13:06:10] error    : 'rsyslog' process is not running
[IST Nov  2 13:06:10] info     : 'rsyslog' trying to restart
[IST Nov  2 13:06:39] error    : Cannot create socket to [X.X.X.X]:9096 -- Connection refused
[IST Nov  2 13:06:39] error    : M/Monit: cannot open a connection to http://[X.X.X.X]:9096/collector
[IST Nov  2 13:06:40] error    : 'rsyslog' process is not running
[IST Nov  2 13:06:40] info     : 'rsyslog' trying to restart
[IST Nov  2 13:07:09] error    : Cannot create socket to [X.X.X.X]:9096 -- Connection refused
[IST Nov  2 13:07:09] error    : M/Monit: cannot open a connection to http://[X.X.X.X]:9096/collector
[IST Nov  2 13:07:10] error    : 'rsyslog' process is not running
[IST Nov  2 13:07:10] info     : 'rsyslog' trying to restart

Comments (12)

  1. Prem Kumar reporter

    Please find the attached debug log for your reference.

    I can able to see the failure event information at MMonit side but unable to see my system metric information ( CPU, Memory,Disk.....etc) at Mmonit side.

  2. Tildeslash repo owner

    Hello Kumar, thank you for data.

    It seems it's a misunderstanding of the event queue - the event queue allows to spool and retry the events only, it doesn't buffer the general service statistics data if M/Monit is not reachable. When M/Monit becomes available, Monit sends all events, so the error state transitions are not lost, but there will be a gap in M/Monit charts, as statistics are not available.

    The log shows that the queued event was delivered correctly (works as expected):

    [IST Nov 17 23:52:20] debug    : Processing postponed events queue
    [IST Nov 17 23:52:20] debug    : Processing queued event '/home/cogniz/monit-5.25.2/queue//1542478549_1e5d660'
    [IST Nov 17 23:52:20] debug    : M/Monit: event message sent to http://[172.16.23.14]:9096/collector
    [IST Nov 17 23:52:20] debug    : Removing queued event /home/cogniz/monit-5.25.2/queue//1542478549_1e5d660
    

    The statistics queue is not available currently => will switch this issue type to feature request (may be implemented in the future)

  3. Prem Kumar reporter

    Thanks Tildeslash for your update.

    This feature should be available by default else we will lose the system metric trends in the systems.

  4. Prem Kumar reporter

    Hi Tildeslash,

    Do we have any timeline on this feature??? because this system is failing in resilience testing.

  5. Tildeslash repo owner

    Storing statistics data on Monit host until connection to M/Monit succeed, is a good suggestion to avoid gaps in M/Monit charts. Still, this is a nice-to-have feature and not really critical. It handles the situation when the network connection is down for longer than a minute (M/Monit's chart granularity is 1 minute). If the host is down or Monit is down there will still be data gaps. This feature is not going to be prioritised, but we will definitely put it on our TODO list.

  6. Log in to comment