dependency: the child service may keep "exec failure" if the parent service recovered

If the service depends on some other service and a child is being started, the child makes sure that the parent service is running without any errors. If the parent has some failure, the child start action in _doStart() is terminated and the following error is thrown:

Event_post(s, Event_Exec, State_Failed, s->action_EXEC, "failed to start -- could not start required services: '%s'", StringBuffer_toString(sb));

If the parent service recovered, the child/dependant service keeps the exec error flag - affects all child service types, except ‘check process’, which reset the exec error flag in check_process() if the process is running:

        /* Reset the exec and timeout errors if active ... the process is running (most probably after manual intervention) */
        if (IS_EVENT_SET(s->error, Event_Exec))
                Event_post(s, Event_Exec, State_Succeeded, s->action_EXEC, "process is running after previous exec error (slow starting or manually recovered?)");

Sample configuration that allows to recreate the problem:

set daemon 5

set httpd port 2812 allow localhost

check file test-child with path /tmp/test2
    start program = "/usr/bin/true"
    stop program = "/usr/bin/true"
    depends on test-parent

check file test-parent with path /tmp/test1
    start program = "/usr/bin/true"
    stop program = "/usr/bin/true"

When /tmp/test1 and /tmp/test2 exist, all is green
rm -f /tmp/test1 /tmp/test2
Monit now detects that the test-child service doesn’t exist and calls the start action, which tries to start the test-parent. As the start script is dummy, the test-parent startup fails, the test-child now gets exec error with “failed to start -- could not start required services: 'test-parent'
touch /tmp/test1 /tmp/test2
Now all services are running file, but the test-child retains the exec error flag, as now start action needs to be called, which would reset the error flag

Solution:

It would be good to assign a new event type (e.g. Event_ParentFailure) to the situation, where child is in error state, because the parent service is down. Using the Event_Exec is problematic, as it is ambiguous - the child script itself may fail to exec, which would set the Event_Exec too. When the child service test starts and it has the “Event_ParentFailure” error active, it should rescan parents state and if all parents are ok, it can clear the Event_ParentFailure error.

Unfortunately, we’re out of event types (see Event_Type in event.h), so we need to refactor the event handler first, to support more event types => this issue is blocked by then.

‌

Comments (2)