dependency: the child service may keep "exec failure" if the parent service recovered

Issue #996 new
Tildeslash repo owner created an issue

If the service depends on some other service and a child is being started, the child makes sure that the parent service is running without any errors. If the parent has some failure, the child start action in _doStart() is terminated and the following error is thrown:

Event_post(s, Event_Exec, State_Failed, s->action_EXEC, "failed to start -- could not start required services: '%s'", StringBuffer_toString(sb));

If the parent service recovered, the child/dependant service keeps the exec error flag - affects all child service types, except ‘check process’, which reset the exec error flag in check_process() if the process is running:

        /* Reset the exec and timeout errors if active ... the process is running (most probably after manual intervention) */
        if (IS_EVENT_SET(s->error, Event_Exec))
                Event_post(s, Event_Exec, State_Succeeded, s->action_EXEC, "process is running after previous exec error (slow starting or manually recovered?)");

Sample configuration that allows to recreate the problem:

set daemon 5

set httpd port 2812 allow localhost

check file test-child with path /tmp/test2
    start program = "/usr/bin/true"
    stop program = "/usr/bin/true"
    depends on test-parent

check file test-parent with path /tmp/test1
    start program = "/usr/bin/true"
    stop program = "/usr/bin/true"
  1. When /tmp/test1 and /tmp/test2 exist, all is green
  2. rm -f /tmp/test1 /tmp/test2
  3. Monit now detects that the test-child service doesn’t exist and calls the start action, which tries to start the test-parent. As the start script is dummy, the test-parent startup fails, the test-child now gets exec error with “failed to start -- could not start required services: 'test-parent'
  4. touch /tmp/test1 /tmp/test2
  5. Now all services are running file, but the test-child retains the exec error flag, as now start action needs to be called, which would reset the error flag

Solution:

It would be good to assign a new event type (e.g. Event_ParentFailure) to the situation, where child is in error state, because the parent service is down. Using the Event_Exec is problematic, as it is ambiguous - the child script itself may fail to exec, which would set the Event_Exec too. When the child service test starts and it has the “Event_ParentFailure” error active, it should rescan parents state and if all parents are ok, it can clear the Event_ParentFailure error.

Unfortunately, we’re out of event types (see Event_Type in event.h), so we need to refactor the event handler first, to support more event types => this issue is blocked by then.

Comments (2)

  1. Lutz Mader

    Thanks Tildeslash,
    nice to see someone have a look to the problem.

    An additional event type is a nice idea to give a hint to the real problem, today this will be seen in the monit log only. Unfortunately it lacks on event types, you are right.

    With regards,
    Lutz

  2. Log in to comment