- edited description
dependency: the child service may keep "exec failure" if the parent service recovered
If the service depends on some other service and a child is being started, the child makes sure that the parent service is running without any errors. If the parent has some failure, the child start action in _doStart() is terminated and the following error is thrown:
Event_post(s, Event_Exec, State_Failed, s->action_EXEC, "failed to start -- could not start required services: '%s'", StringBuffer_toString(sb));
If the parent service recovered, the child/dependant service keeps the exec error flag - affects all child service types, except ‘check process’, which reset the exec error flag in check_process() if the process is running:
/* Reset the exec and timeout errors if active ... the process is running (most probably after manual intervention) */
if (IS_EVENT_SET(s->error, Event_Exec))
Event_post(s, Event_Exec, State_Succeeded, s->action_EXEC, "process is running after previous exec error (slow starting or manually recovered?)");
Sample configuration that allows to recreate the problem:
set daemon 5
set httpd port 2812 allow localhost
check file test-child with path /tmp/test2
start program = "/usr/bin/true"
stop program = "/usr/bin/true"
depends on test-parent
check file test-parent with path /tmp/test1
start program = "/usr/bin/true"
stop program = "/usr/bin/true"
- When /tmp/test1 and /tmp/test2 exist, all is green
- rm -f /tmp/test1 /tmp/test2
- Monit now detects that the test-child service doesn’t exist and calls the start action, which tries to start the test-parent. As the start script is dummy, the test-parent startup fails, the test-child now gets exec error with “
failed to start -- could not start required services: 'test-parent'
- touch /tmp/test1 /tmp/test2
- Now all services are running file, but the test-child retains the exec error flag, as now start action needs to be called, which would reset the error flag
Solution:
It would be good to assign a new event type (e.g. Event_ParentFailure) to the situation, where child is in error state, because the parent service is down. Using the Event_Exec is problematic, as it is ambiguous - the child script itself may fail to exec, which would set the Event_Exec too. When the child service test starts and it has the “Event_ParentFailure” error active, it should rescan parents state and if all parents are ok, it can clear the Event_ParentFailure error.
Unfortunately, we’re out of event types (see Event_Type in event.h), so we need to refactor the event handler first, to support more event types => this issue is blocked by then.
Comments (2)
-
reporter -
Thanks Tildeslash,
nice to see someone have a look to the problem.An additional event type is a nice idea to give a hint to the real problem, today this will be seen in the monit log only. Unfortunately it lacks on event types, you are right.
With regards,
Lutz - Log in to comment