- changed status to duplicate
Monit shows a process is running due to a stale pid file
I'm facing a similar issue, wherein I'm running my services in a docker container which has monit as pid 1 and is monitoring mongodb, vault and nginx. When I run my docker container for the first time, everything comes up properly , but when I do a docker stop "my container" followed by docker start "mycontainer", due to a residual vault pid file from the old startup, monit behaved weirdly and showed the status of vault as "Running" even though the process didn't exist. This is the Monit status for vault
Process 'vault'
status Running
monitoring status Monitored
pid -
parent pid -
uid -
effective uid -
gid -
uptime -
threads -
children -
memory -
memory total -
memory percent -
memory percent total -
cpu percent -
cpu percent total -
data collected Mon, 26 Sep 2016 01:06:29
This is a snippet from monit log file. Seems like it didn't do the test check for vault at all whether pid in the vault.pid file is same as the process running.
[PDT Sep 26 01:15:12] debug : 'mongodb' process test failed [pid=147] -- No such process
[PDT Sep 26 01:15:12] info : 'mongodb' start: /bin/bash
[PDT Sep 26 01:15:12] debug : 'mongodb' started
[PDT Sep 26 01:15:12] info : 'mongodb' process is running with pid 15
[PDT Sep 26 01:15:12] debug : 'mongodb' zombie check succeeded
[PDT Sep 26 01:15:19] debug : 'nginx' process test failed [pid=252] -- No such process
[PDT Sep 26 01:15:19] info : 'nginx' start: /sbin/start-stop-daemon
[PDT Sep 26 01:15:19] debug : 'nginx' started
[PDT Sep 26 01:15:40] debug : 'vault' process is running with pid 225
[PDT Sep 26 01:15:40] debug : 'vault' zombie check succeeded
And this issue, sometimes occurs with vault or sometimes with some other process. But I'm unable to resolve the issue.
Comments (26)
-
repo owner -
repo owner hello, this problem was solved in monit 5.18.0 already
-
reporter @tildeslash I still ran into the same problem today after updating monit to 5.19.0 . But the logs are different this time.
#! monit status Monit 5.19.0 uptime: 10m Process 'mongodb' status Running monitoring status Monitored monitoring mode active on reboot start pid 125 parent pid 1 uid 101 effective uid 101 gid 103 uptime 9m threads 28 children 0 cpu 0.1% cpu total 0.1% memory 0.1% [23.8 MB] memory total 0.1% [23.8 MB] data collected Fri, 14 Oct 2016 15:42:55 Process 'vault' status Running monitoring status Monitored monitoring mode active on reboot start pid 31 parent pid 1 uid 1002 effective uid 1002 gid 4 uptime 9m threads 13 children 0 cpu 0.0% cpu total 0.0% memory 0.0% [7.3 MB] memory total 0.0% [7.3 MB] data collected Fri, 14 Oct 2016 15:42:55 Process 'nginx' status Running monitoring status Monitored monitoring mode active on reboot start pid - parent pid - uid - effective uid - gid - uptime - threads - children - cpu - cpu total - memory - memory total - data collected Fri, 14 Oct 2016 15:42:55
Here is the snippet of monit logs. It errors out at Nginx saying can't get service data and keeps on giving that error throughout . Then why does it show 'Running' in the status. There is a stale pid file of nginx from the previous run. I suspect that may the cause, but monit should check the id in the pid file and the process id and then update the status.
#! [PDT Oct 14 15:32:52] info : Starting Monit 5.19.0 daemon with http interface at [localhost]:9016 [PDT Oct 14 15:32:52] info : 'ise22d1' Monit 5.19.0 started [PDT Oct 14 15:32:52] error : 'mongodb' process is not running [PDT Oct 14 15:32:52] info : 'mongodb' trying to restart [PDT Oct 14 15:32:52] info : 'mongodb' start: /bin/bash [PDT Oct 14 15:33:22] error : 'mongodb' failed to start (exit status 0) -- /bin/bash: * start-stop-daemon: /usr/bin/mongod is already running [PDT Oct 14 15:33:22] error : 'vault' process is not running [PDT Oct 14 15:33:22] info : 'vault' trying to restart [PDT Oct 14 15:33:22] info : 'vault' start: /bin/bash [PDT Oct 14 15:33:30] info : 'mongodb' start: /bin/bash [PDT Oct 14 15:33:30] info : 'mongodb' started [PDT Oct 14 15:33:30] info : 'mongodb' process is running with pid 125 [PDT Oct 14 15:33:30] info : 'vault' process is running with pid 31 [PDT Oct 14 15:33:55] error : 'nginx' failed to get service data [PDT Oct 14 15:34:25] error : 'nginx' failed to get service data [PDT Oct 14 15:34:55] error : 'nginx' failed to get service data [PDT Oct 14 15:35:25] error : 'nginx' failed to get service data [PDT Oct 14 15:35:55] error : 'nginx' failed to get service data [PDT Oct 14 15:36:25] error : 'nginx' failed to get service data [PDT Oct 14 15:36:55] error : 'nginx' failed to get service data [PDT Oct 14 15:37:25] error : 'nginx' failed to get service data [PDT Oct 14 15:37:55] error : 'nginx' failed to get service data [PDT Oct 14 15:38:25] error : 'nginx' failed to get service data [PDT Oct 14 15:38:55] error : 'nginx' failed to get service data [PDT Oct 14 15:39:25] error : 'nginx' failed to get service data [PDT Oct 14 15:39:55] error : 'nginx' failed to get service data [PDT Oct 14 15:40:25] error : 'nginx' failed to get service data [PDT Oct 14 15:40:55] error : 'nginx' failed to get service data [PDT Oct 14 15:41:25] error : 'nginx' failed to get service data
I think this is similar to Issue #151. I believe its quite critical in terms of monit's orchesteration mechanism.
-
reporter - changed status to open
Still faced the same issue after upgrading to 5.19.0
-
repo owner Monit reads the PID from the pidfile and checks if the process is running by searching the process tree for matching PID. It seems the matching PID is present, but with no statistics data.
Please can you run monit in debug mode? (monit -vI) and send output + attach output of "ps -ef".
Can you replicate the problem? (i.e. provide steps which can trigger the situation)?
P.S. Note that monit also supports monitoring by process pattern (using "check process <name> matching <pattern>") ... this method doesn't need a pidfile.
-
I see this too periodically. It appears to happen when the machine is undergoing maintenance and bounce multiple times within a short window. In the case below aggrocrag is not restarted at 05:42
[EST Nov 20 03:08:08] info : Monit HTTP server started [EST Nov 20 03:08:08] info : 'intlnobkcte05' Monit 5.17.1 started [EST Nov 20 03:08:08] error : 'logchipper' process is not running [EST Nov 20 03:08:08] info : 'logchipper' trying to restart [EST Nov 20 03:08:08] info : 'logchipper' restart: /usr/bin/sudo [EST Nov 20 03:08:11] error : 'collectd' process is not running [EST Nov 20 03:08:11] info : 'collectd' trying to restart [EST Nov 20 03:08:11] info : 'collectd' restart: /usr/bin/sudo [EST Nov 20 03:08:13] error : 'aggrocrag' process is not running [EST Nov 20 03:08:13] info : 'aggrocrag' trying to restart [EST Nov 20 03:08:13] info : 'aggrocrag' restart: /usr/bin/sudo [EST Nov 20 03:08:30] info : 'logchipper' process is running with pid 1407 [EST Nov 20 03:08:30] info : 'collectd' process is running with pid 1966 [EST Nov 20 03:08:30] info : 'aggrocrag' process is running with pid 2014 [EST Nov 20 04:28:16] info : Shutting down Monit HTTP server [EST Nov 20 04:28:16] info : Monit HTTP server stopped [EST Nov 20 04:28:16] info : Monit daemon with pid [1319] stopped [EST Nov 20 04:28:16] info : 'intlnobkcte05' Monit 5.17.1 stopped [EST Nov 20 05:42:14] info : Starting Monit 5.17.1 daemon with http interface at [*]:2812 [EST Nov 20 05:42:14] info : Starting Monit HTTP server at [*]:2812 [EST Nov 20 05:42:14] info : Monit HTTP server started [EST Nov 20 05:42:14] info : 'intlnobkcte05' Monit 5.17.1 started [EST Nov 20 05:42:14] error : 'logchipper' process is not running [EST Nov 20 05:42:14] info : 'logchipper' trying to restart [EST Nov 20 05:42:14] info : 'logchipper' restart: /usr/bin/sudo [EST Nov 20 05:42:16] error : 'collectd' process is not running [EST Nov 20 05:42:16] info : 'collectd' trying to restart [EST Nov 20 05:42:16] info : 'collectd' restart: /usr/bin/sudo [EST Nov 20 05:42:33] info : 'logchipper' process is running with pid 1416 [EST Nov 20 05:42:33] info : 'collectd' process is running with pid 2003
-
Is this issues seen when the system is having load?
-
We are also facing the same issue the monit version is 5.16 [UTC Jan 19 17:40:27] info : Starting Monit 5.16 daemon with http interface at [127.0.0.1]:2812 [UTC Jan 19 17:40:27] info : Starting Monit HTTP server at [127.0.0.1]:2812
-
We have moved away from pid files entirely. We now use the process matching string checks and all our issues have been resolved.
-
Hi Scott, Thanks for the input. Can you please give process matching string checks examples.
-
A simple conf example.
check process logchipper matching "logchipper.*/opt/inf/etc/logchipper.json" start program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper restart" stop program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper stop" restart program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper restart" if 3 restarts within 6 cycles then unmonitor
-
Thanks for the input Scott.
-
So am I right in saying the problem still exists and will not be solved, and users should move away from using simple PID files?
Cause I face the same issue right now, with 5.22.0 using PID files, and no process is running with the PID in the file: $ sudo monit --version This is Monit version 5.22.0 Built with ssl, with ipv6, with compression, with pam and with large files Copyright (C) 2001-2017 Tildeslash Ltd. All Rights Reserved.
Log: [SAST Oct 9 09:15:29] error : 'etl-ocs-air-etl-file-decoder-exec-00' failed to get service data [SAST Oct 9 09:15:29] error : 'archive-occ-etl-file-watcher-00' failed to get service data
$ sudo monit status etl-ocs-air-etl-file-decoder-exec-00 Monit 5.22.0 uptime: 2h 13m
Process 'etl-ocs-air-etl-file-decoder-exec-00' status OK monitoring status Monitored monitoring mode active on reboot start pid - parent pid - uid - effective uid - gid - uptime - threads - children - cpu - cpu total - memory - memory total - data collected Tue, 09 Oct 2018 09:19:36
$ cat stream/occ/pid/etl-file-watcher.00.pid 2084 $ ps -ef | grep 2084 archive 15512 14860 0 09:20 pts/20 00:00:00 grep 2084
My monit conf for this process is: CHECK PROCESS archive-occ-etl-file-watcher-00 WITH PIDFILE /home/archive/stream/occ/pid/etl-file-watcher.00.pid GROUP archive GROUP archive-occ START PROGRAM = "/bin/bash -c 'source /home/archive/conf/env.sh; /home/archive/etl/libexec/etl-file-watcher -d -i 0 -s occ --out.format="asn.1" 2>&1 | /sbin/cronolog --symlink=/home/archive/stream/occ/log/etl-file-watcher.00.log /home/archive/stream/occ/log/%Y-%m-%d-etl-file-watcher.00.log &'" as uid "archive" and gid "archive" STOP PROGRAM = "/bin/bash -c 'kill -s SIGTERM $(cat /home/archive/stream/occ/pid/etl-file-watcher.00.pid)'"
All has been working for at least a month and got this just today.
-
Hello,
the problem still exists,This is Monit version 5.25.2
Built with ssl, with ipv6, with compression, with pam and with large files
Copyright (C) 2001-2018 Tildeslash Ltd. All Rights Reserved.A stale pid file will not handles in a proper way, the status is “OK” but no additional data is available with “monit status” and the monit log contain lot of “failed to get service data“ messages.
Unfortunately the additional “if failed host“ tests are not handled also and no restart will initiated.
Sorry, Lutz
-
Hello,
it seems to me monit can not get the status sometimes.
The process is available but monit can not get the requested information from the system (AIX, Linux) for a monitored process, for the other eleven processes, monit seems to be determine the requested information.All messages from the monit.log
[MESZ Jul 4 00:51:05] error : 'Serv_0_abc' failed to get service data
[MESZ Jul 4 05:51:36] error : 'Serv_0_abc' failed to get service dataThe process is available since 56d 22h 10m, the error occurred 11h and 6h ago
Process 'Serv_0_abc' status OK monitoring status Monitored monitoring mode active on reboot start pid 47186188 parent pid 1 uid 32005 effective uid 32005 gid 10199 uptime 56d 22h 10m threads 103 children 0 cpu 7.2% cpu total 7.2% memory 0.3% [166.9 MB] memory total 0.3% [166.9 MB] data collected Thu, 04 Jul 2019 11:50:13
Seems to me a temporary problem only sometimes.
But I find logs with the "failed to get service data" messages every monitor cycle for some resources also.A ugly problem occurring on AIX and Linux,
Lutz -
I don’t want to blame the victim here, but putting pidfiles in a tmpfs (commonly /run) makes old pidfiles disappear when the container restarts. Otherwise, manually cleaning out old pidfiles in container startup scripts makes for fewer races like these.
Dennis
-
Hello Dennis,
the pid file is not the problem, the problem occured in a up and running system without a restart sometimes and disappier without any doing.All messages from the monit.log
[MESZ Jul 4 00:51:05] error : 'Serv_0_abc' failed to get service data
[MESZ Jul 4 05:51:36] error : 'Serv_0_abc' failed to get service dataThe process is available since 56d 22h 10m, the error occurred 11h and 6h ago
At the time (with 5.27.2 or 5.28.0, and 5.26.0 also) I can not find this problems/the messages, but we changed the used versions of Linux and AIX also.
On the other hand, we got some problems at high system workload (cpu usage > 99% and starage > 95%) and system status information for processes and the filesystem sometimes in the past also.
With regards,
Lutz -
I am having the same problem: the monit version: is 5.32.0.
sudo monit summary says OK but sudo monit state shows no PID. and the process does not exist on the system. No alert emails are being sent. the monit logfile has “failed to get process data”
-
Hello Jitan Sahni,
check the system workload and try to get the process information from the /proc filesystem (on a Linux system), please.The monit will gather process information again, sometimes. "failed to get process data" is a temporary problem only.
With regards,
Lutz -
Hi, the load seems fine and it's not a temporary problem. it persists just for one single process. monit is working great for the remaining process. “monit reload” does not fix the problem. I can restart monit - but still trying to collect more facts… so I can report something more concrete to help this get fixed
-
Hello Jitan Sahni,
nice to know.Hi, the load seems fine and it's not a temporary problem.
This is my problem also.
it persists just for one single process. monit is working great for the remaining process.
Unfortunately, I could not collect useful data to find out what is going wrong. On the other hand, in general, the problem will be fixed by a monit restart.
From my point of view this is a system problem and depend to the system workload.Lutz
-
so the restart of the monit did not solve the issue. The PID file had a process id ‘2526’ from old runs and even though there was no system process with id 2526 ( even searched with sudo ps -ef), monit was unable to get process data… The only way I could solve the problem was by manually starting the process and generating a new PID file. Now it is working fine. There is some bug where the PID value 2526 confused monit. very strange.
I reverified by manually creating a PID file with 2526. and monit again reported ‘failed to get process data.
if I create a PID with some other number ( not sure what random 4 or 5-digit number I used), it seemed to work fine and correctly sees that process does not exist and tries to restart.
-
Hello Jitan Sahni,
no idea.I delete the pid file or change the pid in the file to the right pid, if the process is running and stop/start the monit process.
And sometimes I stop the monit process, delete the used "monit.state" file and start monit again.
And monit is running/working again.
Lutz
p.s.
The "monit.state" file is the file defined in the "monitrc" file by the "set statefile" statement. -
This problem is caused by monit using the thread ID as the unique process id, you can use command ps -eLf to view the threads of the system.
I hope the author can fix this problem.
-
In most cases, we can avoid this problem by using check port, can the author add a check process with a tcp/udp port?
-
Hello,
a connection test is available to test ports,
see https://www.mmonit.com/monit/documentation/monit.html#CONNECTION-TESTSYou can use the connection test with check host,
see https://www.mmonit.com/monit/documentation/monit.html#Remote-host
or with the check process,
see https://www.mmonit.com/monit/documentation/monit.html#Process
as an additional test.With regards,
Lutz - Log in to comment
Duplicate of
#367.