Connection testing: RETRY has no effect in case of failure

Issue #211 resolved
Vovodroid created an issue

Documentation says:
retry: RETRY number. Optionally specifies the number of consecutive retries within the same testing cycle in the case that the connection failed. The default is fail on first error.

But it seems that RETRY option has effect only in the case of connection timeout. If connection test fails (whether on TCP reject or status response) test is considered failed and ACTION is performed.

This can cause interference with auto-restarting services. Assume that service (daemon) is being restarted (by user or system), monit recognizes connection test failure and issue new stop/start command.

Step to reproduce: assume following monit.rc:

.........................................
set daemon  5 
.........................................
check process nginx with pidfile /var/run/nginx.pid
  start program = "/usr/bin/systemctl start nginx"
  stop  program = "/usr/bin/systemctl stop nginx"
  every 12 cycles
  if failed port 80 protocol http retry 10  then alert
..........................................

So nginx is monitored once per minute.

nginx configuration:

server {
    listen 80;
    return 200;
}

If change listen port or return code to 500 and run nginx -s reload monit recognizes failure instantly at the next checking cycle, without giving to nginx any chance. I expected that test will be tried RETRY times with some delay between them.

I suggest to apply TIMEOUT options also to failed test, i.e. in such case wait TIMEOUT value, counting this time from very beginning of the test, and perform RETRY attempts. And only after time=TIMEOUT * RETRY consider test failed.

Regards.

Comments (6)

  1. Tildeslash repo owner

    Hello,

    if you want to use retry with delay, then use "for X cycles", for example:

    if failed port 80 protocol http for 10 cycles then alert
    

    The RETRY option allows to retry the connection only in the same cycle with no delay between the attempts.

  2. Vovodroid reporter

    ------The RETRY option allows to retry the connection only in the same cycle with no delay between the attempts.
    That's exactly what I complain about )))

    ----use "for X cycles"
    Well, it's not the same. For example one could like to test some service one per hour, and restart if it doesn't response in five minutes (i.e. in cycle, but still with some fault tolerance), not in two hours.

  3. Tildeslash repo owner

    Currently the pause between checks is given by the cycle length and "every" statement only. The RETRY option was designed to allow to retry in the same cycle and doesn't have its own retry scheduler.

    We're in the process of starting work on new test scheduler, which will be more flexible then the current cycle+every scheduling.

  4. Vovodroid reporter

    It seems that for X cycles uses not global daemon cycle, but final service cycle (global period * service cycle).
    Config:

    set daemon  60 
    
    check host example.com address 127.0.0.1
      every 10 cycles
      start  = "/usr/bin/docker start example.com"
      stop   = "/usr/bin/docker kill   example.com"
      if failed host example.com port 443 protocol https for 1 cycles then restart
    

    Result:

    [UTC Jul 16 11:08:32] error    : 'example.com' failed protocol test [HTTP] at [example.com]:443 [TCP/IP SSL] -- Connection refused
    [UTC Jul 16 11:08:32] info     : 'example.com' trying to restart
    [UTC Jul 16 11:08:32] info     : 'example.com' stop: /usr/bin/docker
    [UTC Jul 16 11:08:32] info     : 'example.com' start: /usr/bin/docker
    [UTC Jul 16 11:18:34] info     : 'example.com' connection succeeded to [example.com]:443 [TCP/IP SSL]
    

    So it took ten minutes to discover that service is alive after it was successfully started. Is it worth to use global cycle in for X cycles ?

  5. Log in to comment