Issues with timeouts in Monit 5.26.0

Issue #911 new
Narendra Patel created an issue

Hi,

I’m trying to automate restarts for our applications but facing a peculiar issue during the same time.

Have created a sample python web app that i’ve started up. Configured it in monit as well with appropriate stop and start commands.

Monit Config:
CHECK PROCESS SAMPLE MATCHING "/vagrant/sample-app.py"
start program = "/bin/sh -c 'cd /vagrant && ./start.sh'" with timeout 30 seconds
stop program = "/bin/sh -c 'cd /vagrant && ./stop.sh'" with timeout 30 seconds
if failed port 80 for 3 cycles then restart

My stop.sh script:

sleep 60
kill $(ps -ef | grep -i sample-app | grep -v grep | grep -i bin | awk '{print $2}')

The problem occurs when we issue stop command to monit for our service and the stop script takes more time than the timeout configured.

stop command: monit stop SAMPLE

Monit times out the script correctly :

Jun 25 21:28:15 localhost monit[3063]: 'SAMPLE' stop on user request
Jun 25 21:28:15 localhost monit[3063]: Monit daemon with PID 3063 awakened
Jun 25 21:28:15 localhost monit[3063]: Awakened by User defined signal 1
Jun 25 21:28:15 localhost monit[3063]: 'SAMPLE' stop: '/bin/sh -c cd /vagrant && ./stop.sh'
Jun 25 21:28:45 localhost monit[3063]: 'SAMPLE' failed to stop (exit status -1) -- Program '/bin/sh -c cd /vagrant && ./stop.sh' timed out after 30 s
Jun 25 21:28:45 localhost monit[3063]: 'SAMPLE' stop action failed

However the process keeps running post the timeout for more 30 seconds due to the 60 seconds sleep that i’ve induced in my stop script.

However when we query the status of the service it gives us Not Monitored with exit code zero.

[root@localhost vagrant]# monit status SAMPLE && echo $? && date
Monit 5.26.0 uptime: 4h 6m

Process 'SAMPLE'
status Not monitored
monitoring status Not monitored
monitoring mode active
on reboot start
data collected Thu, 25 Jun 2020 21:27:53

0

Thu Jun 25 21:28:46 UTC 2020

This is the same output we get when the process gets stopped legitimately without timeout.

Is there a way to identify whether the process stopped correctly or timeout occurred?

Maybe i might be doing something wrong here. Have just started with Monit.

Comments (7)

  1. Tildeslash repo owner

    Hello,

    if you perform service stop, monit will call the stop program, but it doesn’t try to kill the process itself in the case of timeout - the stop sequence is responsibility of the stop program (including fallback to 'kill -9' in case that the program is stuck).

    Doing implicit kill on stop program timeout is dangerous - each process is different and if you kill the process unconditionally, it may not be the best thing (the process may loose some data). Hence monit doesn’t pull the trigger and allows admin to remedy the situation (it stops at least the monitoring though).

    There is probably some room for enhancements:

    1. we may add a new option for “stop program” or a new “stoptimeout program” to allow monit to kill the process (the kill option can be explicitly set in the configuration if it’s safe to kill the process).
    2. maybe monit shouldn’t unmonitor the program if stop failed and the service should enter some “stop failure” state to highlight that stop didn’t work as expected

  2. Narendra Patel reporter

    Thank you so much for your quick reply and thoughts 🙂

    I guess basic exit codes could help for the status command.

    As in 0 for success, 1 in case it timed out and -1 in case the command execution fails.

    Still trying out Monit so don’t know the impact / side effects of the above. You’ll probably be the best to analyze and i understand it might take some time.

    The chances of keeping monitoring on could impact the restart cycles. By setting the status message and exit codes should help identify the state exactly and the user / automation can take decisions appropriately.

  3. Tildeslash repo owner

    Exit code may be useful indication, but as stop program exit code was never significant for monit (monit checks that the process has really stopped by checking the process table), it may break the backward compatibility: there are millions of monit instances and the stop program may not always follow the exit code expectation.

  4. Narendra Patel reporter

    I was suggesting if we can impliment it in status command. As in 0 for all okay, 1 for stop timed out, 2 for start timed out and -1 for command failure.

    That would avoid backward compatibility issues as we don’t change anything in the stop command.

    After running the stop command one can poll the status command and check exit code to determine if the process has started / stopped correctly. This should help avoid timeout issues as well.

  5. Philip Foulkes

    I’ve been caught by this a few times as well. I expect a program to have stopped, but due to a buggy program, it keeps on running after Monit has issued the stop and has timed out. I think both suggestions proposed above would be very welcomed in Monit. At the very least, suggestion 2 will let me know something went wrong. But having the ability to hard kill a process would be useful as well, as I’ve had to script this in stop scripts before.

  6. Lutz Mader

    Hello Tildeslash,
    this is a problem in general. The idea to use "Not monitored" for stopped resources is the problem, from my point of view.

    But, as long as a script can not stop a process, the pid is still available or a matching process will found in the process table, the process is still available. After a timeout the status should something like "stop failure".

    With regards,
    Lutz

  7. Log in to comment