Monit thinks program is running when defunct pid file is present
I have monit checking several hundred tomcat processes, we rebooted the server for patching and several of the JVM's didn't start, because the old defunct catalina.pid file was present on reboot. Here is a sample output of the issue:
Process 'tomcat-inigral-sso'
status Running
monitoring status Monitored
pid 6022
parent pid 0
uid -1
effective uid -1
gid -1
uptime 0m
children 0
memory 0.0 B
memory total 0.0 B
memory percent 0.0%
memory percent total 0.0%
cpu percent 0.0%
cpu percent total 0.0%
port response time 0.000s to localhost:58097 [HTTP via TCP]
data collected Thu, 05 Feb 2015 10:30:09
If I remove the catalina.pid file, monit will pickup that it isn't running and restart the service as intended.
[MST Feb 5 10:37:13] error : 'tomcat-inigral-sso' process is not running
[MST Feb 5 10:37:13] info : 'tomcat-inigral-sso' trying to restart
[MST Feb 5 10:37:13] info : 'tomcat-inigral-sso' restart: /usr/local/tomcat/inigral-sso/tomcat-inigral-sso
My check looks like below:
# *** ANSIBLE MANAGED - DO NOT EDIT BY HAND ***
#
check process tomcat-inigral-sso with pidfile /usr/local/tomcat/inigral-sso/catalina.pid
not every "* 3-6 * * *"
start program = "/usr/local/tomcat/inigral-sso/tomcat-inigral-sso start"
stop program = "/usr/local/tomcat/inigral-sso/tomcat-inigral-sso stop"
restart program = "/usr/local/tomcat/inigral-sso/tomcat-inigral-sso restart"
if failed port 58097 and protocol http retry 6
then restart
group tomcat
check file tomcat-inigral-sso-logs with path /usr/local/tomcat/inigral-sso/logs/catalina.out
not every "* 3-6 * * *"
restart program = "/usr/local/tomcat/inigral-sso/tomcat-inigral-sso restart"
if match
"^.*OutOfMemoryError.*$"
then restart
group tomcat
depends on tomcat-inigral-sso
Comments (13)
-
repo owner -
Account Deleted reporter Yeah, that makes sense now that I'm thinking about it. I'll look at few options. Thanks ;)
-
repo owner - changed status to resolved
-
Account Deleted reporter I did want to add, that this was doing this for over 12 hours. I would think that at some point, since the http retry checks should've have been failing, it would've kicked off a restart of the process. This never happened, this seems like a bug to me. I also checked memory during this time, and there were no other PID's that matched, yet monit still was thinking that it was up.
-
Account Deleted reporter - changed status to open
Added additional comments
-
repo owner Please can you provide monit log? It could be also useful to run monit in debug mode (-v option)
-
Account Deleted reporter Yes, i'll kick off another reboot tonight and upload the logs tomorrow morning. It will respawn with the verbose logs tonight on boot. Thanks.
# DEBUG: mo:2345:respawn:/usr/local/monit/bin/monit -v -c /usr/local/monit/conf/monitrc
-
Account Deleted reporter - attached monit.log.tgz
Here are the monit logs from last night going into this morning. The server was rebooted as well. Thanks.
-
repo owner Hello Aaron, thanks for data and sorry for late response.
According to the status output values (PPID, UID, GID, etc.) it seems that the process was not found in the process tree, but the getpgid() systemcall which is used internally to check if the PID from the pidfile is running, returned either success or EPERM, even though the process is not running.
The workaround could be to use the pattern based process check if each catalina instance has unique pattern - such check is based on process tree snap rather then getpgid(PID).
Will try to reproduce the problem again.
-
repo owner Tried to reproduce on various linux distributions (Ubuntu 14.10, CentOS 5.11, CentOS 6.6) ... set monitoring of 1000 non-existent processes with pidfiles stored in persistent directory (each pidfile contained PID which doesn't correspond to any running process), rebooted the machine - after reboot Monit detected that none of the monitored process is running, despite the pidfile did exist.
Based on the data it seems that the culprit is getpgid() call ... if you can test, i can prepare debug version, which will log more details when it's called and compare it to process tree snapshot.
Possible workarounds: 1.) either place the pidfile to ramdisk, which will reset on reboot (/var/run/ should be fine) 2.) or use the pattern based process check
We plan to refactor the process engine in the (near) future, will keep the getpgid() problems in mind.
-
Some notes: Aaron's problem is not only the pid is existing with a different executable, but service running state is trusted just because the connection test is success. I ususally use connection testing to validate the service running and - as a side effect - validating the PID file is pointing to the right process. We have to make sure we are watching the right process.
My basic idea is just cache the pid executable path internally and if it is changes (but the pid keeps unchanged) then make at least an alert. Keep in your mind, the executable path is not neccessarily same as it seems like in ps (that's the $0/argv[0] variable and can be overridden).
-
repo owner - removed version
Removing version: 5.11 (automated comment)
-
Hey @dasunsrule32 I'm facing a similar issue, the only difference is the environment I'm running my services in. I have a docker container which has monit as pid 1 and is monitoring mongodb, vault and nginx. When I run my docker container for the first time, everything comes up properly , but when I do a docker stop "my container" followed by docker start "mycontainer", due to a residual vault pid file from the old startup, monit behaved weirdly and showed the status of vault as "Running" even though the process didn't exist. This is the Monit status for vault
Process 'vault' status Running monitoring status Monitored pid - parent pid - uid - effective uid - gid - uptime - threads - children - memory - memory total - memory percent - memory percent total - cpu percent - cpu percent total - data collected Mon, 26 Sep 2016 01:06:29
This is a snippet from monit log file. Seems like it didn't do the test check for vault at all whether pid in the vault.pid file is same as the process running.
[PDT Sep 26 01:15:12] debug : 'mongodb' process test failed [pid=147] -- No such process [PDT Sep 26 01:15:12] info : 'mongodb' start: /bin/bash [PDT Sep 26 01:15:12] debug : 'mongodb' started [PDT Sep 26 01:15:12] info : 'mongodb' process is running with pid 15 [PDT Sep 26 01:15:12] debug : 'mongodb' zombie check succeeded [PDT Sep 26 01:15:19] debug : 'nginx' process test failed [pid=252] -- No such process [PDT Sep 26 01:15:19] info : 'nginx' start: /sbin/start-stop-daemon [PDT Sep 26 01:15:19] debug : 'nginx' started [PDT Sep 26 01:15:40] debug : 'vault' process is running with pid 225 [PDT Sep 26 01:15:40] debug : 'vault' zombie check succeeded
And this issue, sometimes occurs with vault or sometimes with some other process. But I'm unable to resolve the issue.
- Log in to comment
The pidfile based process check reads the PID from the file and checks if the process is running - in this case some other process was up with the same PID most probably => the pidfile pointed to valid process.
The pidfile should be stored in tmpfs/ramdisk (for example /var/run/) so it's removed on reboot. Another option is to use pattern based process check ("check process myprocess matching <pattern>")