- changed status to closed
Option to delay check failures due to long process spinup
Right now we are testing a setup where we are using Monit to monitor Logstash and trigger Keepalived to failover should Logstash crash or lock up. Crashing is easy to catch using Monit. Lockups, however, we're using an HTTP check that is filtered to be dropped by Logstash since it's only for health check purposes. The problem is that Logstash (due to Java and Ruby) takes a looong time to spin up. The PID is online immediately, but our HTTP check dies for upwards of 30 seconds.
What I propose is a keyword: "SPINUP DELAY FOR x"
The reason for this is such that Monit can handle starting Logstash (using Monit to invoke it upon bootup as is advertised), but can then wait before checking doing "IF FAILED" checks for the spinup delay. Example:
check process logstash with pidfile /var/run/logstash.pid
start program = "/etc/init.d/logstash start"
stop program = "/etc/init.d/logstash stop"
SPINUP DELAY FOR 60
if failed
host 127.0.0.1 port 58888 protocol http
request "/"
status = 200
then restart
if 3 restarts with 10 cycles then exec "/opt/keepalived/force_fault_state.sh"
if 4 restarts with 10 cycles then timeout
What this does is runs the START keyword due to a PID failure, but waits to execute the HTTP test (or any others within this CHECK block) for 60 seconds.
Thanks!
Comments (1)
-
repo owner - Log in to comment
This feature is implemented already ... monit delays the connection tests for start program's timeout second, for example the following will postpone the connection test by 60 seconds after process restart:
If it doesn't work for you, please check the monit version (monit -V) and upgrade monit.