Handling of slow starting services

Issue #941 resolved
bablex created an issue

We monitor redis pidfile with (m)monit. Unfortunately, redis may take quite a long time restart (depending on amount of data, usage of RDB or AOF), e.g. 10-15 mins.

Unfortunately monit will kill the restart process, as it checks every 1 minute and thus we end in a perpetual death cycle, as monit kills redis while it’s initializing, and restarts it again.
Is there a way to solve this problem with existing config tools?

Comments (6)

  1. Lutz Mader

    Hello bablex,
    this is ugly, in deed.

    For the Apache/IHS webserver I use count the number of cycles before I do a restart.

    check process Ihs_0_server1 with pidfile "/opt/IBM/wlp/servers/appl1/logs/httpd.pid"
      start program "/usr/local/etc/monit/scripts/wlpihs.sh start" with timeout 120 seconds
      stop program "/usr/local/etc/monit/scripts/wlpihs.sh stop" with timeout 120 seconds
      if failed host hostname.local port 8901 for 10 cycles then restart
      if failed host hostname.local port 8901 then alert
      if not exist for 5 cycles then start
      if 5 restarts within 50 cycles then unmonitor
    

    To get a alert imediately I send a alert too. This gives a good overview if something goes wrong.

    The number of cycles depends to your monitor interval (I use 60s, see the used "set daemon" value) and the delay the application listen to the port after the startup.

    A suggestion only,
    Lutz

    p.s.

    Some more or less usefull samples are available at
    https://mmonit.com/wiki/Monit/HowTo and
    https://mmonit.com/wiki/Monit/ConfigurationExamples

  2. bablex reporter

    Thanks for the suggestion 🙂 I implemented it in a similar way now.

    Unfortunately, when using depends on (to trigger a restart), the original issue persists.

  3. Lutz Mader

    Hello bablex,
    are you using something like “timeout 120 seconds“ with the “start program”.

    Are you adding a similar “cycle” check to the services depends to your redis server too.

    With regards,
    Lutz

    p.s.

    Add a snipped of your configuartion to the post, please.

  4. Lutz Mader

    Hello bablex,
    nice to know. Be aware, the error recovery is delayed because monit wait some cycles before it detect a port problem. But if the process died, the recovery start immediately (in the sample after 5 cycles).

    With regards,
    Lutz

  5. Log in to comment