Monit shows a process is running due to a stale pid file

Issue #484 open

Shrenik created an issue 2016-10-14

I'm facing a similar issue, wherein I'm running my services in a docker container which has monit as pid 1 and is monitoring mongodb, vault and nginx. When I run my docker container for the first time, everything comes up properly , but when I do a docker stop "my container" followed by docker start "mycontainer", due to a residual vault pid file from the old startup, monit behaved weirdly and showed the status of vault as "Running" even though the process didn't exist. This is the Monit status for vault

Process 'vault'
  status                            Running
  monitoring status                 Monitored
  pid                               -
  parent pid                        -
  uid                               -
  effective uid                     -
  gid                               -
  uptime                            -
  threads                           -
  children                          -
  memory                            -
  memory total                      -
  memory percent                    -
  memory percent total              -
  cpu percent                       -
  cpu percent total                 -
  data collected                    Mon, 26 Sep 2016 01:06:29

This is a snippet from monit log file. Seems like it didn't do the test check for vault at all whether pid in the vault.pid file is same as the process running.

[PDT Sep 26 01:15:12] debug    : 'mongodb' process test failed [pid=147] -- No such process
[PDT Sep 26 01:15:12] info     : 'mongodb' start: /bin/bash
[PDT Sep 26 01:15:12] debug    : 'mongodb' started
[PDT Sep 26 01:15:12] info     : 'mongodb' process is running with pid 15
[PDT Sep 26 01:15:12] debug    : 'mongodb' zombie check succeeded
[PDT Sep 26 01:15:19] debug    : 'nginx' process test failed [pid=252] -- No such process
[PDT Sep 26 01:15:19] info     : 'nginx' start: /sbin/start-stop-daemon
[PDT Sep 26 01:15:19] debug    : 'nginx' started
[PDT Sep 26 01:15:40] debug    : 'vault' process is running with pid 225
[PDT Sep 26 01:15:40] debug    : 'vault' zombie check succeeded

And this issue, sometimes occurs with vault or sometimes with some other process. But I'm unable to resolve the issue.

Comments (26)

Tildeslash repo owner
- changed status to duplicate
Duplicate of ~~#367~~.
- 2016-10-14T07:55:46+00:00
Tildeslash repo owner
hello, this problem was solved in monit 5.18.0 already
- 2016-10-14T07:56:17+00:00

Shrenik reporter

@tildeslash I still ran into the same problem today after updating monit to 5.19.0 . But the logs are different this time.

#!

monit status
Monit 5.19.0 uptime: 10m

Process 'mongodb'
  status                       Running
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  pid                          125
  parent pid                   1
  uid                          101
  effective uid                101
  gid                          103
  uptime                       9m
  threads                      28
  children                     0
  cpu                          0.1%
  cpu total                    0.1%
  memory                       0.1% [23.8 MB]
  memory total                 0.1% [23.8 MB]
  data collected               Fri, 14 Oct 2016 15:42:55

Process 'vault'
  status                       Running
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  pid                          31
  parent pid                   1
  uid                          1002
  effective uid                1002
  gid                          4
  uptime                       9m
  threads                      13
  children                     0
  cpu                          0.0%
  cpu total                    0.0%
  memory                       0.0% [7.3 MB]
  memory total                 0.0% [7.3 MB]
  data collected               Fri, 14 Oct 2016 15:42:55

Process 'nginx'
  status                       Running
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  pid                          -
  parent pid                   -
  uid                          -
  effective uid                -
  gid                          -
  uptime                       -
  threads                      -
  children                     -
  cpu                          -
  cpu total                    -
  memory                       -
  memory total                 -
  data collected               Fri, 14 Oct 2016 15:42:55

Here is the snippet of monit logs. It errors out at Nginx saying can't get service data and keeps on giving that error throughout . Then why does it show 'Running' in the status. There is a stale pid file of nginx from the previous run. I suspect that may the cause, but monit should check the id in the pid file and the process id and then update the status.

#!
[PDT Oct 14 15:32:52] info     : Starting Monit 5.19.0 daemon with http interface at [localhost]:9016
[PDT Oct 14 15:32:52] info     : 'ise22d1' Monit 5.19.0 started
[PDT Oct 14 15:32:52] error    : 'mongodb' process is not running
[PDT Oct 14 15:32:52] info     : 'mongodb' trying to restart
[PDT Oct 14 15:32:52] info     : 'mongodb' start: /bin/bash
[PDT Oct 14 15:33:22] error    : 'mongodb' failed to start (exit status 0) -- /bin/bash:  * start-stop-daemon: /usr/bin/mongod is already running

[PDT Oct 14 15:33:22] error    : 'vault' process is not running
[PDT Oct 14 15:33:22] info     : 'vault' trying to restart
[PDT Oct 14 15:33:22] info     : 'vault' start: /bin/bash
[PDT Oct 14 15:33:30] info     : 'mongodb' start: /bin/bash
[PDT Oct 14 15:33:30] info     : 'mongodb' started
[PDT Oct 14 15:33:30] info     : 'mongodb' process is running with pid 125
[PDT Oct 14 15:33:30] info     : 'vault' process is running with pid 31
[PDT Oct 14 15:33:55] error    : 'nginx' failed to get service data
[PDT Oct 14 15:34:25] error    : 'nginx' failed to get service data
[PDT Oct 14 15:34:55] error    : 'nginx' failed to get service data
[PDT Oct 14 15:35:25] error    : 'nginx' failed to get service data
[PDT Oct 14 15:35:55] error    : 'nginx' failed to get service data
[PDT Oct 14 15:36:25] error    : 'nginx' failed to get service data
[PDT Oct 14 15:36:55] error    : 'nginx' failed to get service data
[PDT Oct 14 15:37:25] error    : 'nginx' failed to get service data
[PDT Oct 14 15:37:55] error    : 'nginx' failed to get service data
[PDT Oct 14 15:38:25] error    : 'nginx' failed to get service data
[PDT Oct 14 15:38:55] error    : 'nginx' failed to get service data
[PDT Oct 14 15:39:25] error    : 'nginx' failed to get service data
[PDT Oct 14 15:39:55] error    : 'nginx' failed to get service data
[PDT Oct 14 15:40:25] error    : 'nginx' failed to get service data
[PDT Oct 14 15:40:55] error    : 'nginx' failed to get service data
[PDT Oct 14 15:41:25] error    : 'nginx' failed to get service data

I think this is similar to Issue #151. I believe its quite critical in terms of monit's orchesteration mechanism.

2016-10-14T22:52:12+00:00

Shrenik reporter
- changed status to open
Still faced the same issue after upgrading to 5.19.0
- 2016-10-14T22:53:10+00:00
Tildeslash repo owner
Monit reads the PID from the pidfile and checks if the process is running by searching the process tree for matching PID. It seems the matching PID is present, but with no statistics data.

Please can you run monit in debug mode? (monit -vI) and send output + attach output of "ps -ef".

Can you replicate the problem? (i.e. provide steps which can trigger the situation)?

P.S. Note that monit also supports monitoring by process pattern (using "check process <name> matching <pattern>") ... this method doesn't need a pidfile.
- 2016-10-17T09:35:30+00:00

Scott Halstead

I see this too periodically. It appears to happen when the machine is undergoing maintenance and bounce multiple times within a short window. In the case below aggrocrag is not restarted at 05:42

[EST Nov 20 03:08:08] info : Monit HTTP server started
[EST Nov 20 03:08:08] info : 'intlnobkcte05' Monit 5.17.1 started
[EST Nov 20 03:08:08] error : 'logchipper' process is not running
[EST Nov 20 03:08:08] info : 'logchipper' trying to restart
[EST Nov 20 03:08:08] info : 'logchipper' restart: /usr/bin/sudo
[EST Nov 20 03:08:11] error : 'collectd' process is not running
[EST Nov 20 03:08:11] info : 'collectd' trying to restart
[EST Nov 20 03:08:11] info : 'collectd' restart: /usr/bin/sudo
[EST Nov 20 03:08:13] error : 'aggrocrag' process is not running
[EST Nov 20 03:08:13] info : 'aggrocrag' trying to restart
[EST Nov 20 03:08:13] info : 'aggrocrag' restart: /usr/bin/sudo
[EST Nov 20 03:08:30] info : 'logchipper' process is running with pid 1407
[EST Nov 20 03:08:30] info : 'collectd' process is running with pid 1966
[EST Nov 20 03:08:30] info : 'aggrocrag' process is running with pid 2014
[EST Nov 20 04:28:16] info : Shutting down Monit HTTP server
[EST Nov 20 04:28:16] info : Monit HTTP server stopped
[EST Nov 20 04:28:16] info : Monit daemon with pid [1319] stopped
[EST Nov 20 04:28:16] info : 'intlnobkcte05' Monit 5.17.1 stopped
[EST Nov 20 05:42:14] info : Starting Monit 5.17.1 daemon with http interface at [*]:2812
[EST Nov 20 05:42:14] info : Starting Monit HTTP server at [*]:2812
[EST Nov 20 05:42:14] info : Monit HTTP server started
[EST Nov 20 05:42:14] info : 'intlnobkcte05' Monit 5.17.1 started
[EST Nov 20 05:42:14] error : 'logchipper' process is not running
[EST Nov 20 05:42:14] info : 'logchipper' trying to restart
[EST Nov 20 05:42:14] info : 'logchipper' restart: /usr/bin/sudo
[EST Nov 20 05:42:16] error : 'collectd' process is not running
[EST Nov 20 05:42:16] info : 'collectd' trying to restart
[EST Nov 20 05:42:16] info : 'collectd' restart: /usr/bin/sudo
[EST Nov 20 05:42:33] info : 'logchipper' process is running with pid 1416
[EST Nov 20 05:42:33] info : 'collectd' process is running with pid 2003

2016-12-01T19:54:12+00:00

Gautam Shejwalkar
Is this issues seen when the system is having load?
- 2017-01-23T15:41:43+00:00
Gautam Shejwalkar
We are also facing the same issue the monit version is 5.16 [UTC Jan 19 17:40:27] info : Starting Monit 5.16 daemon with http interface at [127.0.0.1]:2812 [UTC Jan 19 17:40:27] info : Starting Monit HTTP server at [127.0.0.1]:2812
- 2017-01-23T15:42:31+00:00
Scott Halstead
We have moved away from pid files entirely. We now use the process matching string checks and all our issues have been resolved.
- 2017-01-23T16:01:47+00:00
Gautam Shejwalkar
Hi Scott, Thanks for the input. Can you please give process matching string checks examples.
- 2017-01-24T13:00:24+00:00

Scott Halstead

A simple conf example.

check process logchipper matching "logchipper.*/opt/inf/etc/logchipper.json"
  start program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper restart"
  stop program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper stop"
  restart program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper restart"
  if 3 restarts within 6 cycles then unmonitor

2017-01-24T13:12:30+00:00

Gautam Shejwalkar
Thanks for the input Scott.
- 2017-01-24T15:28:46+00:00
Jan Semmelink
So am I right in saying the problem still exists and will not be solved, and users should move away from using simple PID files?

Cause I face the same issue right now, with 5.22.0 using PID files, and no process is running with the PID in the file: $ sudo monit --version This is Monit version 5.22.0 Built with ssl, with ipv6, with compression, with pam and with large files Copyright (C) 2001-2017 Tildeslash Ltd. All Rights Reserved.

Log: [SAST Oct 9 09:15:29] error : 'etl-ocs-air-etl-file-decoder-exec-00' failed to get service data [SAST Oct 9 09:15:29] error : 'archive-occ-etl-file-watcher-00' failed to get service data

$ sudo monit status etl-ocs-air-etl-file-decoder-exec-00 Monit 5.22.0 uptime: 2h 13m

Process 'etl-ocs-air-etl-file-decoder-exec-00' status OK monitoring status Monitored monitoring mode active on reboot start pid - parent pid - uid - effective uid - gid - uptime - threads - children - cpu - cpu total - memory - memory total - data collected Tue, 09 Oct 2018 09:19:36

$ cat stream/occ/pid/etl-file-watcher.00.pid 2084 $ ps -ef | grep 2084 archive 15512 14860 0 09:20 pts/20 00:00:00 grep 2084

My monit conf for this process is: CHECK PROCESS archive-occ-etl-file-watcher-00 WITH PIDFILE /home/archive/stream/occ/pid/etl-file-watcher.00.pid GROUP archive GROUP archive-occ START PROGRAM = "/bin/bash -c 'source /home/archive/conf/env.sh; /home/archive/etl/libexec/etl-file-watcher -d -i 0 -s occ --out.format="asn.1" 2>&1 | /sbin/cronolog --symlink=/home/archive/stream/occ/log/etl-file-watcher.00.log /home/archive/stream/occ/log/%Y-%m-%d-etl-file-watcher.00.log &'" as uid "archive" and gid "archive" STOP PROGRAM = "/bin/bash -c 'kill -s SIGTERM $(cat /home/archive/stream/occ/pid/etl-file-watcher.00.pid)'"

All has been working for at least a month and got this just today.
- 2018-10-09T07:20:52+00:00
Lutz Mader
Hello,
the problem still exists,

This is Monit version 5.25.2
Built with ssl, with ipv6, with compression, with pam and with large files
Copyright (C) 2001-2018 Tildeslash Ltd. All Rights Reserved.

A stale pid file will not handles in a proper way, the status is “OK” but no additional data is available with “monit status” and the monit log contain lot of “failed to get service data“ messages.

Unfortunately the additional “if failed host“ tests are not handled also and no restart will initiated.

Sorry, Lutz
- 2019-07-03T16:22:03+00:00

Lutz Mader

Hello,
it seems to me monit can not get the status sometimes.
The process is available but monit can not get the requested information from the system (AIX, Linux) for a monitored process, for the other eleven processes, monit seems to be determine the requested information.

All messages from the monit.log
[MESZ Jul 4 00:51:05] error : 'Serv_0_abc' failed to get service data
[MESZ Jul 4 05:51:36] error : 'Serv_0_abc' failed to get service data

The process is available since 56d 22h 10m, the error occurred 11h and 6h ago

Process 'Serv_0_abc'
  status                       OK
  monitoring status            Monitored
  monitoring mode              active
  on reboot                    start
  pid                          47186188
  parent pid                   1
  uid                          32005
  effective uid                32005
  gid                          10199
  uptime                       56d 22h 10m
  threads                      103
  children                     0
  cpu                          7.2%
  cpu total                    7.2%
  memory                       0.3% [166.9 MB]
  memory total                 0.3% [166.9 MB]
  data collected               Thu, 04 Jul 2019 11:50:13

Seems to me a temporary problem only sometimes.
But I find logs with the "failed to get service data" messages every monitor cycle for some resources also.

A ugly problem occurring on AIX and Linux,
Lutz

2019-07-04T10:31:15+00:00

Dennis Rockwell
I don’t want to blame the victim here, but putting pidfiles in a tmpfs (commonly /run) makes old pidfiles disappear when the container restarts. Otherwise, manually cleaning out old pidfiles in container startup scripts makes for fewer races like these.

Dennis
- 2021-05-18T20:51:27+00:00
Lutz Mader
Hello Dennis,
the pid file is not the problem, the problem occured in a up and running system without a restart sometimes and disappier without any doing.

All messages from the monit.log
[MESZ Jul 4 00:51:05] error : 'Serv_0_abc' failed to get service data
[MESZ Jul 4 05:51:36] error : 'Serv_0_abc' failed to get service data

The process is available since 56d 22h 10m, the error occurred 11h and 6h ago

At the time (with 5.27.2 or 5.28.0, and 5.26.0 also) I can not find this problems/the messages, but we changed the used versions of Linux and AIX also.

On the other hand, we got some problems at high system workload (cpu usage > 99% and starage > 95%) and system status information for processes and the filesystem sometimes in the past also.

With regards,
Lutz
- 2021-07-11T09:16:13+00:00
Jitan Sahni
I am having the same problem: the monit version: is 5.32.0.
sudo monit summary says OK but sudo monit state shows no PID. and the process does not exist on the system. No alert emails are being sent. the monit logfile has “failed to get process data”

‌

‌
- 2023-01-08T15:31:19+00:00
Lutz Mader
Hello Jitan Sahni,
check the system workload and try to get the process information from the /proc filesystem (on a Linux system), please.

The monit will gather process information again, sometimes. "failed to get process data" is a temporary problem only.

With regards,
Lutz
- 2023-01-08T16:32:55+00:00
Jitan Sahni
Hi, the load seems fine and it's not a temporary problem. it persists just for one single process. monit is working great for the remaining process. “monit reload” does not fix the problem. I can restart monit - but still trying to collect more facts… so I can report something more concrete to help this get fixed
- 2023-01-08T16:52:41+00:00
Lutz Mader
Hello Jitan Sahni,
nice to know.

Hi, the load seems fine and it's not a temporary problem.

This is my problem also.

it persists just for one single process. monit is working great for the remaining process.

Unfortunately, I could not collect useful data to find out what is going wrong. On the other hand, in general, the problem will be fixed by a monit restart.
From my point of view this is a system problem and depend to the system workload.

Lutz
- 2023-01-08T17:42:13+00:00
Jitan Sahni
so the restart of the monit did not solve the issue. The PID file had a process id ‘2526’ from old runs and even though there was no system process with id 2526 ( even searched with sudo ps -ef), monit was unable to get process data… The only way I could solve the problem was by manually starting the process and generating a new PID file. Now it is working fine. There is some bug where the PID value 2526 confused monit. very strange.

I reverified by manually creating a PID file with 2526. and monit again reported ‘failed to get process data.

if I create a PID with some other number ( not sure what random 4 or 5-digit number I used), it seemed to work fine and correctly sees that process does not exist and tries to restart.

‌

‌
- 2023-01-08T17:54:04+00:00
Lutz Mader
Hello Jitan Sahni,
no idea.

I delete the pid file or change the pid in the file to the right pid, if the process is running and stop/start the monit process.

And sometimes I stop the monit process, delete the used "monit.state" file and start monit again.

And monit is running/working again.

Lutz

p.s.
The "monit.state" file is the file defined in the "monitrc" file by the "set statefile" statement.
- 2023-01-08T20:20:18+00:00
jeff vines
This problem is caused by monit using the thread ID as the unique process id, you can use command ps -eLf to view the threads of the system.

I hope the author can fix this problem.
- 2023-03-06T10:42:10+00:00
jeff vines
In most cases, we can avoid this problem by using check port, can the author add a check process with a tcp/udp port?
- 2023-03-06T10:46:51+00:00
Lutz Mader
Hello,
a connection test is available to test ports,
see https://www.mmonit.com/monit/documentation/monit.html#CONNECTION-TESTS

You can use the connection test with check host,
see https://www.mmonit.com/monit/documentation/monit.html#Remote-host
or with the check process,
see https://www.mmonit.com/monit/documentation/monit.html#Process
as an additional test.

With regards,
Lutz
- 2023-03-06T14:36:12+00:00
Log in to comment

Assignee: Tildeslash

Type: bug

Priority: blocker

Status: open

Component: Monit

Version: 5.17.1

Votes: 4

Watchers: 9