- edited description
Monit resilience testing failed
I am trying to do resilience testing between monit to Mmonit process by making network disconnect between Monit & Mmonit processes.
I have configured event queue directory & validating my Resource Metric information ( CPU, Memory.....etc) and Alarm events.
During this testing, I have observed no other events stored in my queue directory.
I have not received Resource and event details at Mmonit side.
How do we handle these cases??. Please suggest.
Below are my configuration details,
set daemon 30
set log syslog
set mmonit http://monit:monit@X.X.X.X:9096/collector
set eventqueue basedir /home/cogniz/monit-5.25.2/queue/
set httpd port 2812 and
use address X.X.X.X
allow X.X.X.X
allow admin:monit
check system X.X.X.X
if loadavg (1min) > 40 then alert
if loadavg (5min) > 30 then alert
if cpu usage > 50% for 1 cycles then alert
if memory usage > 75% then alert
if swap usage > 25% then alert
check filesystem root with path /
if space usage > 70% then alert
check filesystem home with path /home
if space usage > 70% then alert
check network public with interface eth0
if failed link then alert
if changed link then alert
if saturation > 90% then alert
if download > 10 MB/s then alert
if upload > 10000 packets/s then alert
if total uploaded > 9999990 GB in last hour then alert
check process rsyslog with pidfile /home/cogniz/rsyslog/syslog.pid
if cpu > 10% for 1 cycles then alert
if cpu > 10% for 5 cycles then alert
if totalmem > 200.0 MB for 5 cycles then alert
if children > 250 then alert
if loadavg(5min) greater than 10 for 8 cycles then alert
if disk read > 500 kb/s for 10 cycles then alert
if disk write > 500 kb/s for 10 cycles then alert
Logs:
[IST Nov 2 13:04:39] error : Cannot create socket to [X.X.X.X]:9096 -- Connection refused
[IST Nov 2 13:04:39] error : M/Monit: cannot open a connection to http://[X.X.X.X]:9096/collector
[IST Nov 2 13:04:40] error : 'rsyslog' process is not running
[IST Nov 2 13:04:40] info : 'rsyslog' trying to restart
[IST Nov 2 13:05:09] error : Cannot create socket to [X.X.X.X]:9096 -- Connection refused
[IST Nov 2 13:05:09] error : M/Monit: cannot open a connection to http://[X.X.X.X]:9096/collector
[IST Nov 2 13:05:10] error : 'rsyslog' process is not running
[IST Nov 2 13:05:10] info : 'rsyslog' trying to restart
[IST Nov 2 13:05:39] error : Cannot create socket to [X.X.X.X]:9096 -- Connection refused
[IST Nov 2 13:05:39] error : M/Monit: cannot open a connection to http://[X.X.X.X]:9096/collector
[IST Nov 2 13:05:40] error : 'rsyslog' process is not running
[IST Nov 2 13:05:40] info : 'rsyslog' trying to restart
[IST Nov 2 13:06:09] error : Cannot create socket to [X.X.X.X]:9096 -- Connection refused
[IST Nov 2 13:06:09] error : M/Monit: cannot open a connection to http://[X.X.X.X]:9096/collector
[IST Nov 2 13:06:10] error : 'rsyslog' process is not running
[IST Nov 2 13:06:10] info : 'rsyslog' trying to restart
[IST Nov 2 13:06:39] error : Cannot create socket to [X.X.X.X]:9096 -- Connection refused
[IST Nov 2 13:06:39] error : M/Monit: cannot open a connection to http://[X.X.X.X]:9096/collector
[IST Nov 2 13:06:40] error : 'rsyslog' process is not running
[IST Nov 2 13:06:40] info : 'rsyslog' trying to restart
[IST Nov 2 13:07:09] error : Cannot create socket to [X.X.X.X]:9096 -- Connection refused
[IST Nov 2 13:07:09] error : M/Monit: cannot open a connection to http://[X.X.X.X]:9096/collector
[IST Nov 2 13:07:10] error : 'rsyslog' process is not running
[IST Nov 2 13:07:10] info : 'rsyslog' trying to restart
Comments (12)
-
reporter -
reporter - edited description
-
repo owner - edited description
-
assigned issue to
-
repo owner - edited description
-
repo owner Please can you run monit in debug mode during the test and send output?:
monit -vI
-
reporter - attached monit_log.zip
Please find the attached debug log for your reference.
I can able to see the failure event information at MMonit side but unable to see my system metric information ( CPU, Memory,Disk.....etc) at Mmonit side.
-
reporter Hi Tildeslash,
Can you please check this and update
-
repo owner Hello Kumar, thank you for data.
It seems it's a misunderstanding of the event queue - the event queue allows to spool and retry the events only, it doesn't buffer the general service statistics data if M/Monit is not reachable. When M/Monit becomes available, Monit sends all events, so the error state transitions are not lost, but there will be a gap in M/Monit charts, as statistics are not available.
The log shows that the queued event was delivered correctly (works as expected):
[IST Nov 17 23:52:20] debug : Processing postponed events queue [IST Nov 17 23:52:20] debug : Processing queued event '/home/cogniz/monit-5.25.2/queue//1542478549_1e5d660' [IST Nov 17 23:52:20] debug : M/Monit: event message sent to http://[172.16.23.14]:9096/collector [IST Nov 17 23:52:20] debug : Removing queued event /home/cogniz/monit-5.25.2/queue//1542478549_1e5d660
The statistics queue is not available currently => will switch this issue type to feature request (may be implemented in the future)
-
repo owner - marked as enhancement
- marked as major
-
reporter Thanks Tildeslash for your update.
This feature should be available by default else we will lose the system metric trends in the systems.
-
reporter Hi Tildeslash,
Do we have any timeline on this feature??? because this system is failing in resilience testing.
-
repo owner Storing statistics data on Monit host until connection to M/Monit succeed, is a good suggestion to avoid gaps in M/Monit charts. Still, this is a nice-to-have feature and not really critical. It handles the situation when the network connection is down for longer than a minute (M/Monit's chart granularity is 1 minute). If the host is down or Monit is down there will still be data gaps. This feature is not going to be prioritised, but we will definitely put it on our TODO list.
- Log in to comment