Option to delay check failures due to long process spinup

Right now we are testing a setup where we are using Monit to monitor Logstash and trigger Keepalived to failover should Logstash crash or lock up. Crashing is easy to catch using Monit. Lockups, however, we're using an HTTP check that is filtered to be dropped by Logstash since it's only for health check purposes. The problem is that Logstash (due to Java and Ruby) takes a looong time to spin up. The PID is online immediately, but our HTTP check dies for upwards of 30 seconds.

What I propose is a keyword: "SPINUP DELAY FOR x"

The reason for this is such that Monit can handle starting Logstash (using Monit to invoke it upon bootup as is advertised), but can then wait before checking doing "IF FAILED" checks for the spinup delay. Example:

check process logstash with pidfile /var/run/logstash.pid
  start program = "/etc/init.d/logstash start"
  stop program = "/etc/init.d/logstash stop"
  SPINUP DELAY FOR 60
  if failed
    host 127.0.0.1 port 58888 protocol http
    request "/"
    status = 200
  then restart
  if 3 restarts with 10 cycles then exec "/opt/keepalived/force_fault_state.sh"
  if 4 restarts with 10 cycles then timeout

What this does is runs the START keyword due to a PID failure, but waits to execute the HTTP test (or any others within this CHECK block) for 60 seconds.

Thanks!

Comments (1)