Monit shows a process is running due to a stale pid file

Issue #484 open
Shrenik
created an issue

I'm facing a similar issue, wherein I'm running my services in a docker container which has monit as pid 1 and is monitoring mongodb, vault and nginx. When I run my docker container for the first time, everything comes up properly , but when I do a docker stop "my container" followed by docker start "mycontainer", due to a residual vault pid file from the old startup, monit behaved weirdly and showed the status of vault as "Running" even though the process didn't exist. This is the Monit status for vault

Process 'vault'
  status                            Running
  monitoring status                 Monitored
  pid                               -
  parent pid                        -
  uid                               -
  effective uid                     -
  gid                               -
  uptime                            -
  threads                           -
  children                          -
  memory                            -
  memory total                      -
  memory percent                    -
  memory percent total              -
  cpu percent                       -
  cpu percent total                 -
  data collected                    Mon, 26 Sep 2016 01:06:29

This is a snippet from monit log file. Seems like it didn't do the test check for vault at all whether pid in the vault.pid file is same as the process running.

[PDT Sep 26 01:15:12] debug    : 'mongodb' process test failed [pid=147] -- No such process
[PDT Sep 26 01:15:12] info     : 'mongodb' start: /bin/bash
[PDT Sep 26 01:15:12] debug    : 'mongodb' started
[PDT Sep 26 01:15:12] info     : 'mongodb' process is running with pid 15
[PDT Sep 26 01:15:12] debug    : 'mongodb' zombie check succeeded
[PDT Sep 26 01:15:19] debug    : 'nginx' process test failed [pid=252] -- No such process
[PDT Sep 26 01:15:19] info     : 'nginx' start: /sbin/start-stop-daemon
[PDT Sep 26 01:15:19] debug    : 'nginx' started
[PDT Sep 26 01:15:40] debug    : 'vault' process is running with pid 225
[PDT Sep 26 01:15:40] debug    : 'vault' zombie check succeeded

And this issue, sometimes occurs with vault or sometimes with some other process. But I'm unable to resolve the issue.

Comments (13)

  1. Shrenik reporter

    @Tildeslash I still ran into the same problem today after updating monit to 5.19.0 . But the logs are different this time.

    monit status
    Monit 5.19.0 uptime: 10m
    
    Process 'mongodb'
      status                       Running
      monitoring status            Monitored
      monitoring mode              active
      on reboot                    start
      pid                          125
      parent pid                   1
      uid                          101
      effective uid                101
      gid                          103
      uptime                       9m
      threads                      28
      children                     0
      cpu                          0.1%
      cpu total                    0.1%
      memory                       0.1% [23.8 MB]
      memory total                 0.1% [23.8 MB]
      data collected               Fri, 14 Oct 2016 15:42:55
    
    Process 'vault'
      status                       Running
      monitoring status            Monitored
      monitoring mode              active
      on reboot                    start
      pid                          31
      parent pid                   1
      uid                          1002
      effective uid                1002
      gid                          4
      uptime                       9m
      threads                      13
      children                     0
      cpu                          0.0%
      cpu total                    0.0%
      memory                       0.0% [7.3 MB]
      memory total                 0.0% [7.3 MB]
      data collected               Fri, 14 Oct 2016 15:42:55
    
    Process 'nginx'
      status                       Running
      monitoring status            Monitored
      monitoring mode              active
      on reboot                    start
      pid                          -
      parent pid                   -
      uid                          -
      effective uid                -
      gid                          -
      uptime                       -
      threads                      -
      children                     -
      cpu                          -
      cpu total                    -
      memory                       -
      memory total                 -
      data collected               Fri, 14 Oct 2016 15:42:55
    

    Here is the snippet of monit logs. It errors out at Nginx saying can't get service data and keeps on giving that error throughout . Then why does it show 'Running' in the status. There is a stale pid file of nginx from the previous run. I suspect that may the cause, but monit should check the id in the pid file and the process id and then update the status.

    [PDT Oct 14 15:32:52] info     : Starting Monit 5.19.0 daemon with http interface at [localhost]:9016
    [PDT Oct 14 15:32:52] info     : 'ise22d1' Monit 5.19.0 started
    [PDT Oct 14 15:32:52] error    : 'mongodb' process is not running
    [PDT Oct 14 15:32:52] info     : 'mongodb' trying to restart
    [PDT Oct 14 15:32:52] info     : 'mongodb' start: /bin/bash
    [PDT Oct 14 15:33:22] error    : 'mongodb' failed to start (exit status 0) -- /bin/bash:  * start-stop-daemon: /usr/bin/mongod is already running
    
    [PDT Oct 14 15:33:22] error    : 'vault' process is not running
    [PDT Oct 14 15:33:22] info     : 'vault' trying to restart
    [PDT Oct 14 15:33:22] info     : 'vault' start: /bin/bash
    [PDT Oct 14 15:33:30] info     : 'mongodb' start: /bin/bash
    [PDT Oct 14 15:33:30] info     : 'mongodb' started
    [PDT Oct 14 15:33:30] info     : 'mongodb' process is running with pid 125
    [PDT Oct 14 15:33:30] info     : 'vault' process is running with pid 31
    [PDT Oct 14 15:33:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:34:25] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:34:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:35:25] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:35:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:36:25] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:36:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:37:25] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:37:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:38:25] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:38:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:39:25] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:39:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:40:25] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:40:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:41:25] error    : 'nginx' failed to get service data
    

    I think this is similar to Issue #151. I believe its quite critical in terms of monit's orchesteration mechanism.

  2. Tildeslash repo owner

    Monit reads the PID from the pidfile and checks if the process is running by searching the process tree for matching PID. It seems the matching PID is present, but with no statistics data.

    Please can you run monit in debug mode? (monit -vI) and send output + attach output of "ps -ef".

    Can you replicate the problem? (i.e. provide steps which can trigger the situation)?

    P.S. Note that monit also supports monitoring by process pattern (using "check process <name> matching <pattern>") ... this method doesn't need a pidfile.

  3. Scott Halstead

    I see this too periodically. It appears to happen when the machine is undergoing maintenance and bounce multiple times within a short window. In the case below aggrocrag is not restarted at 05:42

    [EST Nov 20 03:08:08] info : Monit HTTP server started
    [EST Nov 20 03:08:08] info : 'intlnobkcte05' Monit 5.17.1 started
    [EST Nov 20 03:08:08] error : 'logchipper' process is not running
    [EST Nov 20 03:08:08] info : 'logchipper' trying to restart
    [EST Nov 20 03:08:08] info : 'logchipper' restart: /usr/bin/sudo
    [EST Nov 20 03:08:11] error : 'collectd' process is not running
    [EST Nov 20 03:08:11] info : 'collectd' trying to restart
    [EST Nov 20 03:08:11] info : 'collectd' restart: /usr/bin/sudo
    [EST Nov 20 03:08:13] error : 'aggrocrag' process is not running
    [EST Nov 20 03:08:13] info : 'aggrocrag' trying to restart
    [EST Nov 20 03:08:13] info : 'aggrocrag' restart: /usr/bin/sudo
    [EST Nov 20 03:08:30] info : 'logchipper' process is running with pid 1407
    [EST Nov 20 03:08:30] info : 'collectd' process is running with pid 1966
    [EST Nov 20 03:08:30] info : 'aggrocrag' process is running with pid 2014
    [EST Nov 20 04:28:16] info : Shutting down Monit HTTP server
    [EST Nov 20 04:28:16] info : Monit HTTP server stopped
    [EST Nov 20 04:28:16] info : Monit daemon with pid [1319] stopped
    [EST Nov 20 04:28:16] info : 'intlnobkcte05' Monit 5.17.1 stopped
    [EST Nov 20 05:42:14] info : Starting Monit 5.17.1 daemon with http interface at [*]:2812
    [EST Nov 20 05:42:14] info : Starting Monit HTTP server at [*]:2812
    [EST Nov 20 05:42:14] info : Monit HTTP server started
    [EST Nov 20 05:42:14] info : 'intlnobkcte05' Monit 5.17.1 started
    [EST Nov 20 05:42:14] error : 'logchipper' process is not running
    [EST Nov 20 05:42:14] info : 'logchipper' trying to restart
    [EST Nov 20 05:42:14] info : 'logchipper' restart: /usr/bin/sudo
    [EST Nov 20 05:42:16] error : 'collectd' process is not running
    [EST Nov 20 05:42:16] info : 'collectd' trying to restart
    [EST Nov 20 05:42:16] info : 'collectd' restart: /usr/bin/sudo
    [EST Nov 20 05:42:33] info : 'logchipper' process is running with pid 1416
    [EST Nov 20 05:42:33] info : 'collectd' process is running with pid 2003
    
  4. gshejwalkar

    We are also facing the same issue the monit version is 5.16 [UTC Jan 19 17:40:27] info : Starting Monit 5.16 daemon with http interface at [127.0.0.1]:2812 [UTC Jan 19 17:40:27] info : Starting Monit HTTP server at [127.0.0.1]:2812

  5. Scott Halstead

    A simple conf example.

    check process logchipper matching "logchipper.*/opt/inf/etc/logchipper.json"
      start program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper restart"
      stop program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper stop"
      restart program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper restart"
      if 3 restarts within 6 cycles then unmonitor
    
  6. Jan Semmelink

    So am I right in saying the problem still exists and will not be solved, and users should move away from using simple PID files?

    Cause I face the same issue right now, with 5.22.0 using PID files, and no process is running with the PID in the file: $ sudo monit --version This is Monit version 5.22.0 Built with ssl, with ipv6, with compression, with pam and with large files Copyright (C) 2001-2017 Tildeslash Ltd. All Rights Reserved.

    Log: [SAST Oct 9 09:15:29] error : 'etl-ocs-air-etl-file-decoder-exec-00' failed to get service data [SAST Oct 9 09:15:29] error : 'archive-occ-etl-file-watcher-00' failed to get service data

    $ sudo monit status etl-ocs-air-etl-file-decoder-exec-00 Monit 5.22.0 uptime: 2h 13m

    Process 'etl-ocs-air-etl-file-decoder-exec-00' status OK monitoring status Monitored monitoring mode active on reboot start pid - parent pid - uid - effective uid - gid - uptime - threads - children - cpu - cpu total - memory - memory total - data collected Tue, 09 Oct 2018 09:19:36

    $ cat stream/occ/pid/etl-file-watcher.00.pid 2084 $ ps -ef | grep 2084 archive 15512 14860 0 09:20 pts/20 00:00:00 grep 2084

    My monit conf for this process is: CHECK PROCESS archive-occ-etl-file-watcher-00 WITH PIDFILE /home/archive/stream/occ/pid/etl-file-watcher.00.pid GROUP archive GROUP archive-occ START PROGRAM = "/bin/bash -c 'source /home/archive/conf/env.sh; /home/archive/etl/libexec/etl-file-watcher -d -i 0 -s occ --out.format="asn.1" 2>&1 | /sbin/cronolog --symlink=/home/archive/stream/occ/log/etl-file-watcher.00.log /home/archive/stream/occ/log/%Y-%m-%d-etl-file-watcher.00.log &'" as uid "archive" and gid "archive" STOP PROGRAM = "/bin/bash -c 'kill -s SIGTERM $(cat /home/archive/stream/occ/pid/etl-file-watcher.00.pid)'"

    All has been working for at least a month and got this just today.

  7. Log in to comment