Process is restarted indefinitely
Issue #546
new
For some reason Monit isn't reseting the counter for http checks when a restart is triggered. The problem was witnessed when restarting Elasticsearch after 10 failed http queries.
Here's the repro-case with a simple systemd service:
[GMT Feb 3 01:56:28] info : 'test-es' monitor action done
[GMT Feb 3 01:56:28] error : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb 3 01:56:43] error : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb 3 01:56:58] error : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb 3 01:57:13] error : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb 3 01:57:28] error : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb 3 01:57:43] error : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb 3 01:57:58] error : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb 3 01:58:13] error : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb 3 01:58:28] error : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb 3 01:58:43] error : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb 3 01:58:43] info : 'test-es' trying to restart
[GMT Feb 3 01:58:43] info : 'test-es' stop: '/bin/systemctl stop test-es.service'
[GMT Feb 3 01:58:43] info : 'test-es' start: '/bin/systemctl start test-es.service'
[GMT Feb 3 01:59:15] error : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb 3 01:59:15] info : 'test-es' trying to restart
[GMT Feb 3 01:59:15] info : 'test-es' stop: '/bin/systemctl stop test-es.service'
[GMT Feb 3 01:59:15] info : 'test-es' start: '/bin/systemctl start test-es.service'
[GMT Feb 3 01:59:46] error : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb 3 01:59:46] info : 'test-es' trying to restart
[GMT Feb 3 01:59:46] info : 'test-es' stop: '/bin/systemctl stop test-es.service'
[GMT Feb 3 01:59:46] info : 'test-es' start: '/bin/systemctl start test-es.service'
[GMT Feb 3 02:00:16] error : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb 3 02:00:16] info : 'test-es' trying to restart
[GMT Feb 3 02:00:16] info : 'test-es' stop: '/bin/systemctl stop test-es.service'
[GMT Feb 3 02:00:16] info : 'test-es' start: '/bin/systemctl start test-es.service'
[GMT Feb 3 02:00:48] error : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb 3 02:00:48] info : 'test-es' trying to restart
[GMT Feb 3 02:00:48] info : 'test-es' stop: '/bin/systemctl stop test-es.service'
[GMT Feb 3 02:00:48] info : 'test-es' start: '/bin/systemctl start test-es.service'
[GMT Feb 3 02:01:19] error : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb 3 02:01:19] info : 'test-es' trying to restart
[GMT Feb 3 02:01:19] info : 'test-es' stop: '/bin/systemctl stop test-es.service'
[GMT Feb 3 02:01:19] info : 'test-es' start: '/bin/systemctl start test-es.service'
check process test-es with pidfile /run/test-es.pid
start program = "/bin/systemctl start test-es.service"
stop program = "/bin/systemctl stop test-es.service"
if failed
host XXX.XXX.XXX.XXX
port YYYY
protocol http
request /_cat/health
status 200
timeout 10 seconds
for 10 cycles
then restart
if 2 restarts within 2 cycles then unmonitor
Instead of retrying the http test for 10 cycles, it does only one test. Also, service isn't unmonitored after two restarts. From what I've seen, if the service writes its PID immediately after restart then this behavior occurs. If PID isn't written then there's still only one http test but the service is unmonitored after two fails.
Seems like this is a regression of
#64and also there are #711 and #787 which seem to be duplicate of this bug.