Status and state handling improvement

Hello Tildeslash,
based on your manual I use something like

if 2 restarts within 3 cycles then unmonitor

to prevent endless/useless recovery, like suggested by the manual. This works well because monit stop useless recovery for broken applications or resources.

But after a system restart, these broken applications are not restarted by monit because I use "reboot laststate" to prevent starting stopped applications with a system restart.

I use an additional "nostart" flag file with some scripts as a workaround to find broken applications on systems I use "reboot start“, but this workaround does not work well and depends to the used scripts.

My question/suggestion,
monit should discern between stopped and broken services. The use of “Not monitored" for both does not fit well sometimes. Monit can not handle "reboot laststate" well, because the last state is not the desired state sometimes.

To stop recovery for applications the service should set to "Broken", or something else. And monit should stop monitoring these applications.

if 2 restarts within 3 cycles then broken

It is important to determine that the application has failed in a nonrecoverable fashion and the state should show this.

With the “stop“ command, a service should set to "Stopped" and monit can stop or continue monitoring these applications or resources. If monit continue monitoring the stopped applications or resources, monit get the right state if someone start the application outside monit or a resource became
available by other reason. If monit stop monitoring the stopped applications or resources, monit save some resources and works in the same manner like now.

From my point of view it is important monit does not stop monitoring the application until the "stop command" finished, to check the application again. Today monit stop the monitoring but the application is not stopped and the state became “Not monitored“ but the application is still available, sometimes.
A more useful start/stop command failure or timeout handling seems to me is necessary, something like a "start failed" and "stop failed" is useful to handle command timeouts, I think. Today monit use a similar status information like “start pending“ to show additional transient status information.

You can use the state “Zombie“ or “Problem“ (or “OK“) for applications started outside monit, but this depends to the enabled or disabled monitoring after the “stop command“ finished.

The "start" command enable the monitoring and start the application if a "start command" is available. If the start command failed the transient status information should changed from “start pending“ to “start failed“ to give some additional help to find the failure. A status like “Does not exist - start failed“ or “Link down - start failed“ is more helpful than “Does not exist“ or “Link down“ only. This is more helpful, specially if the recovery does not work. Today “Link down“ is useful, but you can not see that monit tried to recover the problem or not.

With the "unmonitor" command the service status should set to "Not monitored" and the monitoring should stop, this fit well today. And the “monitor" command enable the monitoring again. After a system restart the services handling should depend to the reboot option.

The not used action “ignore“ should became available for debug purpose.
This action is useful to do nothing but enable a test for testing purpose without sending an alert or execute a command.

To simplify the in the past used states "Accessible", "Running", "Online with all services", "Running", "Status ok" and "UP" to “OK“ was a nice idea. The next step should be the splitting of “Not monitored“ into “Stopped“, “Broken“ and “Not Monitored“. A “Stopped“ service is not like a “Not monitored“ service and vice versa.

A question/suggestion only, for monit 5.30.

With regards,
Lutz

p.s.
Sorry, sometimes I mixed state and status, but monit use this terms synonymous too.
Unfortunately, in german the terms “status“ and “state“ or “Status“ and “Zustand“ are used synonymic very often, but this is wrong from the view of the control theory.

Based on monit states are “OK“ or “Not monitored“, intermediate states are “Initializing“ or “Waiting“. All other information (from Event_Table) are status information or like “start pending“ transient status information only.
But perhaps, I’m wrong.

p.s.
A more useful status model documentation should added to the manual to give some more help to understand the status or state changes inside monit.

Comments (2)