Monit shows a process is running due to a stale pid file

Issue #484 open
Shrenik created an issue

I'm facing a similar issue, wherein I'm running my services in a docker container which has monit as pid 1 and is monitoring mongodb, vault and nginx. When I run my docker container for the first time, everything comes up properly , but when I do a docker stop "my container" followed by docker start "mycontainer", due to a residual vault pid file from the old startup, monit behaved weirdly and showed the status of vault as "Running" even though the process didn't exist. This is the Monit status for vault

Process 'vault'
  status                            Running
  monitoring status                 Monitored
  pid                               -
  parent pid                        -
  uid                               -
  effective uid                     -
  gid                               -
  uptime                            -
  threads                           -
  children                          -
  memory                            -
  memory total                      -
  memory percent                    -
  memory percent total              -
  cpu percent                       -
  cpu percent total                 -
  data collected                    Mon, 26 Sep 2016 01:06:29

This is a snippet from monit log file. Seems like it didn't do the test check for vault at all whether pid in the vault.pid file is same as the process running.

[PDT Sep 26 01:15:12] debug    : 'mongodb' process test failed [pid=147] -- No such process
[PDT Sep 26 01:15:12] info     : 'mongodb' start: /bin/bash
[PDT Sep 26 01:15:12] debug    : 'mongodb' started
[PDT Sep 26 01:15:12] info     : 'mongodb' process is running with pid 15
[PDT Sep 26 01:15:12] debug    : 'mongodb' zombie check succeeded
[PDT Sep 26 01:15:19] debug    : 'nginx' process test failed [pid=252] -- No such process
[PDT Sep 26 01:15:19] info     : 'nginx' start: /sbin/start-stop-daemon
[PDT Sep 26 01:15:19] debug    : 'nginx' started
[PDT Sep 26 01:15:40] debug    : 'vault' process is running with pid 225
[PDT Sep 26 01:15:40] debug    : 'vault' zombie check succeeded

And this issue, sometimes occurs with vault or sometimes with some other process. But I'm unable to resolve the issue.

Comments (26)

  1. Shrenik reporter

    @tildeslash I still ran into the same problem today after updating monit to 5.19.0 . But the logs are different this time.

    #!
    
    monit status
    Monit 5.19.0 uptime: 10m
    
    Process 'mongodb'
      status                       Running
      monitoring status            Monitored
      monitoring mode              active
      on reboot                    start
      pid                          125
      parent pid                   1
      uid                          101
      effective uid                101
      gid                          103
      uptime                       9m
      threads                      28
      children                     0
      cpu                          0.1%
      cpu total                    0.1%
      memory                       0.1% [23.8 MB]
      memory total                 0.1% [23.8 MB]
      data collected               Fri, 14 Oct 2016 15:42:55
    
    Process 'vault'
      status                       Running
      monitoring status            Monitored
      monitoring mode              active
      on reboot                    start
      pid                          31
      parent pid                   1
      uid                          1002
      effective uid                1002
      gid                          4
      uptime                       9m
      threads                      13
      children                     0
      cpu                          0.0%
      cpu total                    0.0%
      memory                       0.0% [7.3 MB]
      memory total                 0.0% [7.3 MB]
      data collected               Fri, 14 Oct 2016 15:42:55
    
    Process 'nginx'
      status                       Running
      monitoring status            Monitored
      monitoring mode              active
      on reboot                    start
      pid                          -
      parent pid                   -
      uid                          -
      effective uid                -
      gid                          -
      uptime                       -
      threads                      -
      children                     -
      cpu                          -
      cpu total                    -
      memory                       -
      memory total                 -
      data collected               Fri, 14 Oct 2016 15:42:55
    

    Here is the snippet of monit logs. It errors out at Nginx saying can't get service data and keeps on giving that error throughout . Then why does it show 'Running' in the status. There is a stale pid file of nginx from the previous run. I suspect that may the cause, but monit should check the id in the pid file and the process id and then update the status.

    #!
    [PDT Oct 14 15:32:52] info     : Starting Monit 5.19.0 daemon with http interface at [localhost]:9016
    [PDT Oct 14 15:32:52] info     : 'ise22d1' Monit 5.19.0 started
    [PDT Oct 14 15:32:52] error    : 'mongodb' process is not running
    [PDT Oct 14 15:32:52] info     : 'mongodb' trying to restart
    [PDT Oct 14 15:32:52] info     : 'mongodb' start: /bin/bash
    [PDT Oct 14 15:33:22] error    : 'mongodb' failed to start (exit status 0) -- /bin/bash:  * start-stop-daemon: /usr/bin/mongod is already running
    
    [PDT Oct 14 15:33:22] error    : 'vault' process is not running
    [PDT Oct 14 15:33:22] info     : 'vault' trying to restart
    [PDT Oct 14 15:33:22] info     : 'vault' start: /bin/bash
    [PDT Oct 14 15:33:30] info     : 'mongodb' start: /bin/bash
    [PDT Oct 14 15:33:30] info     : 'mongodb' started
    [PDT Oct 14 15:33:30] info     : 'mongodb' process is running with pid 125
    [PDT Oct 14 15:33:30] info     : 'vault' process is running with pid 31
    [PDT Oct 14 15:33:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:34:25] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:34:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:35:25] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:35:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:36:25] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:36:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:37:25] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:37:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:38:25] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:38:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:39:25] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:39:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:40:25] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:40:55] error    : 'nginx' failed to get service data
    [PDT Oct 14 15:41:25] error    : 'nginx' failed to get service data
    

    I think this is similar to Issue #151. I believe its quite critical in terms of monit's orchesteration mechanism.

  2. Tildeslash repo owner

    Monit reads the PID from the pidfile and checks if the process is running by searching the process tree for matching PID. It seems the matching PID is present, but with no statistics data.

    Please can you run monit in debug mode? (monit -vI) and send output + attach output of "ps -ef".

    Can you replicate the problem? (i.e. provide steps which can trigger the situation)?

    P.S. Note that monit also supports monitoring by process pattern (using "check process <name> matching <pattern>") ... this method doesn't need a pidfile.

  3. Scott Halstead

    I see this too periodically. It appears to happen when the machine is undergoing maintenance and bounce multiple times within a short window. In the case below aggrocrag is not restarted at 05:42

    [EST Nov 20 03:08:08] info : Monit HTTP server started
    [EST Nov 20 03:08:08] info : 'intlnobkcte05' Monit 5.17.1 started
    [EST Nov 20 03:08:08] error : 'logchipper' process is not running
    [EST Nov 20 03:08:08] info : 'logchipper' trying to restart
    [EST Nov 20 03:08:08] info : 'logchipper' restart: /usr/bin/sudo
    [EST Nov 20 03:08:11] error : 'collectd' process is not running
    [EST Nov 20 03:08:11] info : 'collectd' trying to restart
    [EST Nov 20 03:08:11] info : 'collectd' restart: /usr/bin/sudo
    [EST Nov 20 03:08:13] error : 'aggrocrag' process is not running
    [EST Nov 20 03:08:13] info : 'aggrocrag' trying to restart
    [EST Nov 20 03:08:13] info : 'aggrocrag' restart: /usr/bin/sudo
    [EST Nov 20 03:08:30] info : 'logchipper' process is running with pid 1407
    [EST Nov 20 03:08:30] info : 'collectd' process is running with pid 1966
    [EST Nov 20 03:08:30] info : 'aggrocrag' process is running with pid 2014
    [EST Nov 20 04:28:16] info : Shutting down Monit HTTP server
    [EST Nov 20 04:28:16] info : Monit HTTP server stopped
    [EST Nov 20 04:28:16] info : Monit daemon with pid [1319] stopped
    [EST Nov 20 04:28:16] info : 'intlnobkcte05' Monit 5.17.1 stopped
    [EST Nov 20 05:42:14] info : Starting Monit 5.17.1 daemon with http interface at [*]:2812
    [EST Nov 20 05:42:14] info : Starting Monit HTTP server at [*]:2812
    [EST Nov 20 05:42:14] info : Monit HTTP server started
    [EST Nov 20 05:42:14] info : 'intlnobkcte05' Monit 5.17.1 started
    [EST Nov 20 05:42:14] error : 'logchipper' process is not running
    [EST Nov 20 05:42:14] info : 'logchipper' trying to restart
    [EST Nov 20 05:42:14] info : 'logchipper' restart: /usr/bin/sudo
    [EST Nov 20 05:42:16] error : 'collectd' process is not running
    [EST Nov 20 05:42:16] info : 'collectd' trying to restart
    [EST Nov 20 05:42:16] info : 'collectd' restart: /usr/bin/sudo
    [EST Nov 20 05:42:33] info : 'logchipper' process is running with pid 1416
    [EST Nov 20 05:42:33] info : 'collectd' process is running with pid 2003
    
  4. Gautam Shejwalkar

    We are also facing the same issue the monit version is 5.16 [UTC Jan 19 17:40:27] info : Starting Monit 5.16 daemon with http interface at [127.0.0.1]:2812 [UTC Jan 19 17:40:27] info : Starting Monit HTTP server at [127.0.0.1]:2812

  5. Scott Halstead

    We have moved away from pid files entirely. We now use the process matching string checks and all our issues have been resolved.

  6. Gautam Shejwalkar

    Hi Scott, Thanks for the input. Can you please give process matching string checks examples.

  7. Scott Halstead

    A simple conf example.

    check process logchipper matching "logchipper.*/opt/inf/etc/logchipper.json"
      start program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper restart"
      stop program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper stop"
      restart program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper restart"
      if 3 restarts within 6 cycles then unmonitor
    
  8. Jan Semmelink

    So am I right in saying the problem still exists and will not be solved, and users should move away from using simple PID files?

    Cause I face the same issue right now, with 5.22.0 using PID files, and no process is running with the PID in the file: $ sudo monit --version This is Monit version 5.22.0 Built with ssl, with ipv6, with compression, with pam and with large files Copyright (C) 2001-2017 Tildeslash Ltd. All Rights Reserved.

    Log: [SAST Oct 9 09:15:29] error : 'etl-ocs-air-etl-file-decoder-exec-00' failed to get service data [SAST Oct 9 09:15:29] error : 'archive-occ-etl-file-watcher-00' failed to get service data

    $ sudo monit status etl-ocs-air-etl-file-decoder-exec-00 Monit 5.22.0 uptime: 2h 13m

    Process 'etl-ocs-air-etl-file-decoder-exec-00' status OK monitoring status Monitored monitoring mode active on reboot start pid - parent pid - uid - effective uid - gid - uptime - threads - children - cpu - cpu total - memory - memory total - data collected Tue, 09 Oct 2018 09:19:36

    $ cat stream/occ/pid/etl-file-watcher.00.pid 2084 $ ps -ef | grep 2084 archive 15512 14860 0 09:20 pts/20 00:00:00 grep 2084

    My monit conf for this process is: CHECK PROCESS archive-occ-etl-file-watcher-00 WITH PIDFILE /home/archive/stream/occ/pid/etl-file-watcher.00.pid GROUP archive GROUP archive-occ START PROGRAM = "/bin/bash -c 'source /home/archive/conf/env.sh; /home/archive/etl/libexec/etl-file-watcher -d -i 0 -s occ --out.format="asn.1" 2>&1 | /sbin/cronolog --symlink=/home/archive/stream/occ/log/etl-file-watcher.00.log /home/archive/stream/occ/log/%Y-%m-%d-etl-file-watcher.00.log &'" as uid "archive" and gid "archive" STOP PROGRAM = "/bin/bash -c 'kill -s SIGTERM $(cat /home/archive/stream/occ/pid/etl-file-watcher.00.pid)'"

    All has been working for at least a month and got this just today.

  9. Lutz Mader

    Hello,
    the problem still exists,

    This is Monit version 5.25.2
    Built with ssl, with ipv6, with compression, with pam and with large files
    Copyright (C) 2001-2018 Tildeslash Ltd. All Rights Reserved.

    A stale pid file will not handles in a proper way, the status is “OK” but no additional data is available with “monit status” and the monit log contain lot of “failed to get service data“ messages.

    Unfortunately the additional “if failed host“ tests are not handled also and no restart will initiated.

    Sorry, Lutz

  10. Lutz Mader

    Hello,
    it seems to me monit can not get the status sometimes.
    The process is available but monit can not get the requested information from the system (AIX, Linux) for a monitored process, for the other eleven processes, monit seems to be determine the requested information.

    All messages from the monit.log
    [MESZ Jul 4 00:51:05] error : 'Serv_0_abc' failed to get service data
    [MESZ Jul 4 05:51:36] error : 'Serv_0_abc' failed to get service data

    The process is available since 56d 22h 10m, the error occurred 11h and 6h ago

    Process 'Serv_0_abc'
      status                       OK
      monitoring status            Monitored
      monitoring mode              active
      on reboot                    start
      pid                          47186188
      parent pid                   1
      uid                          32005
      effective uid                32005
      gid                          10199
      uptime                       56d 22h 10m
      threads                      103
      children                     0
      cpu                          7.2%
      cpu total                    7.2%
      memory                       0.3% [166.9 MB]
      memory total                 0.3% [166.9 MB]
      data collected               Thu, 04 Jul 2019 11:50:13
    

    Seems to me a temporary problem only sometimes.
    But I find logs with the "failed to get service data" messages every monitor cycle for some resources also.

    A ugly problem occurring on AIX and Linux,
    Lutz

  11. Dennis Rockwell

    I don’t want to blame the victim here, but putting pidfiles in a tmpfs (commonly /run) makes old pidfiles disappear when the container restarts. Otherwise, manually cleaning out old pidfiles in container startup scripts makes for fewer races like these.

    Dennis

  12. Lutz Mader

    Hello Dennis,
    the pid file is not the problem, the problem occured in a up and running system without a restart sometimes and disappier without any doing.

    All messages from the monit.log
    [MESZ Jul 4 00:51:05] error : 'Serv_0_abc' failed to get service data
    [MESZ Jul 4 05:51:36] error : 'Serv_0_abc' failed to get service data

    The process is available since 56d 22h 10m, the error occurred 11h and 6h ago

    At the time (with 5.27.2 or 5.28.0, and 5.26.0 also) I can not find this problems/the messages, but we changed the used versions of Linux and AIX also.

    On the other hand, we got some problems at high system workload (cpu usage > 99% and starage > 95%) and system status information for processes and the filesystem sometimes in the past also.

    With regards,
    Lutz

  13. Jitan Sahni

    I am having the same problem: the monit version: is 5.32.0.
    sudo monit summary says OK but sudo monit state shows no PID. and the process does not exist on the system. No alert emails are being sent. the monit logfile has “failed to get process data”

  14. Lutz Mader

    Hello Jitan Sahni,
    check the system workload and try to get the process information from the /proc filesystem (on a Linux system), please.

    The monit will gather process information again, sometimes. "failed to get process data" is a temporary problem only.

    With regards,
    Lutz

  15. Jitan Sahni

    Hi, the load seems fine and it's not a temporary problem. it persists just for one single process. monit is working great for the remaining process. “monit reload” does not fix the problem. I can restart monit - but still trying to collect more facts… so I can report something more concrete to help this get fixed

  16. Lutz Mader

    Hello Jitan Sahni,
    nice to know.

    Hi, the load seems fine and it's not a temporary problem.

    This is my problem also.

    it persists just for one single process. monit is working great for the remaining process.

    Unfortunately, I could not collect useful data to find out what is going wrong. On the other hand, in general, the problem will be fixed by a monit restart.
    From my point of view this is a system problem and depend to the system workload.

    Lutz

  17. Jitan Sahni

    so the restart of the monit did not solve the issue. The PID file had a process id ‘2526’ from old runs and even though there was no system process with id 2526 ( even searched with sudo ps -ef), monit was unable to get process data… The only way I could solve the problem was by manually starting the process and generating a new PID file. Now it is working fine. There is some bug where the PID value 2526 confused monit. very strange.

    I reverified by manually creating a PID file with 2526. and monit again reported ‘failed to get process data.

    if I create a PID with some other number ( not sure what random 4 or 5-digit number I used), it seemed to work fine and correctly sees that process does not exist and tries to restart.

  18. Lutz Mader

    Hello Jitan Sahni,
    no idea.

    I delete the pid file or change the pid in the file to the right pid, if the process is running and stop/start the monit process.

    And sometimes I stop the monit process, delete the used "monit.state" file and start monit again.

    And monit is running/working again.

    Lutz

    p.s.
    The "monit.state" file is the file defined in the "monitrc" file by the "set statefile" statement.

  19. jeff vines

    This problem is caused by monit using the thread ID as the unique process id, you can use command ps -eLf to view the threads of the system.

    I hope the author can fix this problem.

  20. jeff vines

    In most cases, we can avoid this problem by using check port, can the author add a check process with a tcp/udp port?

  21. Log in to comment