- changed status to resolved
Connection testing: RETRY has no effect in case of failure
Documentation says:
retry: RETRY number. Optionally specifies the number of consecutive retries within the same testing cycle in the case that the connection failed. The default is fail on first error.
But it seems that RETRY option has effect only in the case of connection timeout. If connection test fails (whether on TCP reject or status response) test is considered failed and ACTION is performed.
This can cause interference with auto-restarting services. Assume that service (daemon) is being restarted (by user or system), monit recognizes connection test failure and issue new stop/start command.
Step to reproduce: assume following monit.rc:
.........................................
set daemon 5
.........................................
check process nginx with pidfile /var/run/nginx.pid
start program = "/usr/bin/systemctl start nginx"
stop program = "/usr/bin/systemctl stop nginx"
every 12 cycles
if failed port 80 protocol http retry 10 then alert
..........................................
So nginx is monitored once per minute.
nginx configuration:
server {
listen 80;
return 200;
}
If change listen port or return code to 500 and run nginx -s reload
monit recognizes failure instantly at the next checking cycle, without giving to nginx any chance. I expected that test will be tried RETRY times with some delay between them.
I suggest to apply TIMEOUT options also to failed test, i.e. in such case wait TIMEOUT value, counting this time from very beginning of the test, and perform RETRY attempts. And only after time=TIMEOUT * RETRY consider test failed.
Regards.
Comments (6)
-
repo owner -
reporter ------The RETRY option allows to retry the connection only in the same cycle with no delay between the attempts.
That's exactly what I complain about )))----use "for X cycles"
Well, it's not the same. For example one could like to test some service one per hour, and restart if it doesn't response in five minutes (i.e. in cycle, but still with some fault tolerance), not in two hours. -
repo owner Currently the pause between checks is given by the cycle length and "every" statement only. The RETRY option was designed to allow to retry in the same cycle and doesn't have its own retry scheduler.
We're in the process of starting work on new test scheduler, which will be more flexible then the current cycle+every scheduling.
-
repo owner - removed version
Removing version: 5.13 (automated comment)
-
reporter It seems that for X cycles uses not global daemon cycle, but final service cycle (global period * service cycle).
Config:set daemon 60 check host example.com address 127.0.0.1 every 10 cycles start = "/usr/bin/docker start example.com" stop = "/usr/bin/docker kill example.com" if failed host example.com port 443 protocol https for 1 cycles then restart
Result:
[UTC Jul 16 11:08:32] error : 'example.com' failed protocol test [HTTP] at [example.com]:443 [TCP/IP SSL] -- Connection refused [UTC Jul 16 11:08:32] info : 'example.com' trying to restart [UTC Jul 16 11:08:32] info : 'example.com' stop: /usr/bin/docker [UTC Jul 16 11:08:32] info : 'example.com' start: /usr/bin/docker [UTC Jul 16 11:18:34] info : 'example.com' connection succeeded to [example.com]:443 [TCP/IP SSL]
So it took ten minutes to discover that service is alive after it was successfully started. Is it worth to use global cycle in for X cycles ?
-
repo owner @to_vova yes, there is standalone task to clarify the "for X cycles" in combination with "every" statement: https://bitbucket.org/tildeslash/monit/issues/174/the-for-x-cycles-is-confusing-if-the-test
- Log in to comment
Hello,
if you want to use retry with delay, then use "for X cycles", for example:
The RETRY option allows to retry the connection only in the same cycle with no delay between the attempts.