monit exits after starting with exit code 1

Issue #454 closed
Aniket Bhat created an issue

I am seeing occassionally that the main monit process exits with exit-code 1. I am attaching monit log and monitrc files herewith along with the output of systemctl status monit.

[root@localhost ~]# systemctl status monit
? monit.service - SYSV: Monit is a utility for managing and monitoring processes,
   Loaded: loaded (/etc/rc.d/init.d/monit)
   Active: failed (Result: exit-code) since Tue 2016-08-23 17:14:52 UTC; 1 day 5h ago
     Docs: man:systemd-sysv-generator(8)
  Process: 3173 ExecStop=/etc/rc.d/init.d/monit stop (code=exited, status=0/SUCCESS)
  Process: 868 ExecStart=/etc/rc.d/init.d/monit start (code=exited, status=0/SUCCESS)
  Main PID: 1097 (code=exited, status=1/FAILURE)

Aug 23 16:28:51 localhost.localdomain systemd[1]: Starting SYSV: Monit is a utility for managing and monitoring processes,...
Aug 23 16:28:51 localhost.localdomain monit[868]: Starting monit: Starting Monit 5.17.1 daemon with http interface at [localhost]:2812
Aug 23 16:28:51 localhost.localdomain monit[868]: [  OK  ]
Aug 23 16:28:51 localhost.localdomain systemd[1]: PID file /var/run/monit.pid not readable (yet?) after start.
Aug 23 16:28:51 localhost.localdomain systemd[1]: Started SYSV: Monit is a utility for managing and monitoring processes,.
Aug 23 17:14:52 host-97-77.openstacklocal systemd[1]: monit.service: main process exited, code=exited, status=1/FAILURE
Aug 23 17:14:52 host-97-77.openstacklocal monit[3173]: Stopping monit: [FAILED]
Aug 23 17:14:52 host-97-77.openstacklocal systemd[1]: Unit monit.service entered failed state.
Aug 23 17:14:52 host-97-77.openstacklocal systemd[1]: monit.service failed.
[root@localhost ~]#

This is in a VM environment. The VM boots with localhost.localdomain as the fqdn and monit is started as a service on boot-up. After this I am editing the hostname on the Linux CLI:

hostname host-97-77.openstacklocal

And then doing a monit reload. This leads to the crash as seen from the monit.log. Is this a known issue? Is there any way to turn on more debugging to find out what is causing this crash?

Thanks, Aniket.

Comments (11)

  1. Tildeslash repo owner

    There is configuration error, so monit stopped on reload:

    [UTC Aug 23 17:14:52] error    : Depend service 'etcd-status' is not defined in the control file
    

    Please remove the "depend on etcd-status" from your monit configuration.

  2. Aniket Bhat reporter

    I am not sure I understand. The depend on etcd-status is a valid dependency specified in one of the monit files that I have to monitor a process. Why is that a configuration error? Is there a specific order in which monit reads the .monit files from the include directory?

    I have a monit file for etcd called etcd.monit which defines the etcd-status check.

    check process etcd matching "etcd --name"
      start program "/usr/local/bin/docker-compose -p nuage -f /opt/vsd/docker/etcd-ha.yml up -d"
      stop program "/usr/bin/docker stop nuage_etcd_1"
      group common
    
      check program etcd-status with path /opt/vsd/docker/test-scripts/etcd-status.sh
      if status != 0 then alert
        group common
        group check
    

    and I have another monit file in the same include directory called zookeper.monit which has the dependency set:

    check process zookeeper matching "nuageZookeeper"
      start program "/usr/local/bin/docker-compose -p nuage -f /opt/vsd/docker/zookeeper.yml up -d"
      stop program "/usr/bin/docker stop nuage_zookeeper_1"
      depends on etcd-status
      group check
    

    Is it possible that zookeeper.monit is read before etcd.monit? Are they read in alphabetical order?

    Thanks, Aniket.

  3. Aniket Bhat reporter

    From monit documentation, the monit files are read in a non-sorted manner. I am thinking of including the etcd.monit specifically first and then include all other monit control files with a wild carded globstring. Would this be the recommended way?

  4. Tildeslash repo owner

    The dependency reference is checked at the end of configuration file parsing, when all files were included, the include order is not significant.

    I tried to reproduce the problem, but it works fine in our lab with the latest monit release. Can you reproduce the issue?

  5. Tildeslash repo owner

    As mentioned in previous update, the order is not significant. All "depends on <service>" statements are evaluated in the postparse() function, which is executed after all files were included.

    For example if files will be included in the following order:

    /etc/monit.d/include/00-A:

    check process A matching "A"
        depends on B
    

    /etc/monit.d/include/01-B:

    check process B matching "B"
    

    Then it'll work normally, even though B is included after A.

  6. Aniket Bhat reporter

    Got it. If there is any additional debugging I can turn on to get the monit logs from when the crash happens, I can do that. Not sure what the root cause is of the exit-status 1. A backtrace or something that indicates the root cause would have been helpful.

  7. Tildeslash repo owner

    The root cause is, as mentioned, that the service "etcd-status", which is required for "zookeeper" was not found during configuration parsing (triggered by "monit reload"). Invalid "depends on" reference is configuration error and monit exits.

    Is it possible that the file "etcd.monit" which contains definition of "etcd-status" was temporarily removed? (after all "monit reload" was called probably as part of configuration change).

    You can start monit with "-v" option to get debug output.

    You can also run "monit -t" periodically for example from cron and log output ... the "-t" option just validates the configuration, if "etcd-status" will be missing, it'll display error.

  8. Log in to comment