Process is restarted indefinitely

For some reason Monit isn't reseting the counter for http checks when a restart is triggered. The problem was witnessed when restarting Elasticsearch after 10 failed http queries.

Here's the repro-case with a simple systemd service:

[GMT Feb  3 01:56:28] info     : 'test-es' monitor action done
[GMT Feb  3 01:56:28] error    : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb  3 01:56:43] error    : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb  3 01:56:58] error    : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb  3 01:57:13] error    : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb  3 01:57:28] error    : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb  3 01:57:43] error    : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb  3 01:57:58] error    : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb  3 01:58:13] error    : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb  3 01:58:28] error    : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb  3 01:58:43] error    : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb  3 01:58:43] info     : 'test-es' trying to restart
[GMT Feb  3 01:58:43] info     : 'test-es' stop: '/bin/systemctl stop test-es.service'
[GMT Feb  3 01:58:43] info     : 'test-es' start: '/bin/systemctl start test-es.service'
[GMT Feb  3 01:59:15] error    : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb  3 01:59:15] info     : 'test-es' trying to restart
[GMT Feb  3 01:59:15] info     : 'test-es' stop: '/bin/systemctl stop test-es.service'
[GMT Feb  3 01:59:15] info     : 'test-es' start: '/bin/systemctl start test-es.service'
[GMT Feb  3 01:59:46] error    : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb  3 01:59:46] info     : 'test-es' trying to restart
[GMT Feb  3 01:59:46] info     : 'test-es' stop: '/bin/systemctl stop test-es.service'
[GMT Feb  3 01:59:46] info     : 'test-es' start: '/bin/systemctl start test-es.service'
[GMT Feb  3 02:00:16] error    : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb  3 02:00:16] info     : 'test-es' trying to restart
[GMT Feb  3 02:00:16] info     : 'test-es' stop: '/bin/systemctl stop test-es.service'
[GMT Feb  3 02:00:16] info     : 'test-es' start: '/bin/systemctl start test-es.service'
[GMT Feb  3 02:00:48] error    : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb  3 02:00:48] info     : 'test-es' trying to restart
[GMT Feb  3 02:00:48] info     : 'test-es' stop: '/bin/systemctl stop test-es.service'
[GMT Feb  3 02:00:48] info     : 'test-es' start: '/bin/systemctl start test-es.service'
[GMT Feb  3 02:01:19] error    : 'test-es' failed protocol test [HTTP] at [XXX.XXX.XXX]:YYYY/_cat/health [TCP/IP] -- Connection refused
[GMT Feb  3 02:01:19] info     : 'test-es' trying to restart
[GMT Feb  3 02:01:19] info     : 'test-es' stop: '/bin/systemctl stop test-es.service'
[GMT Feb  3 02:01:19] info     : 'test-es' start: '/bin/systemctl start test-es.service'

check process test-es with pidfile /run/test-es.pid
  start program = "/bin/systemctl start test-es.service"
  stop program = "/bin/systemctl stop test-es.service"
  if failed
    host XXX.XXX.XXX.XXX
    port YYYY
    protocol http
    request /_cat/health
    status 200
    timeout 10 seconds
  for 10 cycles
  then restart
  if 2 restarts within 2 cycles then unmonitor

Instead of retrying the http test for 10 cycles, it does only one test. Also, service isn't unmonitored after two restarts. From what I've seen, if the service writes its PID immediately after restart then this behavior occurs. If PID isn't written then there's still only one http test but the service is unmonitored after two fails.

Comments (1)