Monit thinks program is running when defunct pid file is present

Issue #151 open
Aaron Echols (dasunsrule32)
created an issue

I have monit checking several hundred tomcat processes, we rebooted the server for patching and several of the JVM's didn't start, because the old defunct catalina.pid file was present on reboot. Here is a sample output of the issue:

Process 'tomcat-inigral-sso'
  status                            Running
  monitoring status                 Monitored
  pid                               6022
  parent pid                        0
  uid                               -1
  effective uid                     -1
  gid                               -1
  uptime                            0m
  children                          0
  memory                            0.0 B
  memory total                      0.0 B
  memory percent                    0.0%
  memory percent total              0.0%
  cpu percent                       0.0%
  cpu percent total                 0.0%
  port response time                0.000s to localhost:58097 [HTTP via TCP]
  data collected                    Thu, 05 Feb 2015 10:30:09

If I remove the catalina.pid file, monit will pickup that it isn't running and restart the service as intended.

[MST Feb  5 10:37:13] error    : 'tomcat-inigral-sso' process is not running
[MST Feb  5 10:37:13] info     : 'tomcat-inigral-sso' trying to restart
[MST Feb  5 10:37:13] info     : 'tomcat-inigral-sso' restart: /usr/local/tomcat/inigral-sso/tomcat-inigral-sso

My check looks like below:

# *** ANSIBLE MANAGED - DO NOT EDIT BY HAND ***
#
check process tomcat-inigral-sso with pidfile /usr/local/tomcat/inigral-sso/catalina.pid
   not every "* 3-6 * * *"
   start program = "/usr/local/tomcat/inigral-sso/tomcat-inigral-sso start"
   stop program = "/usr/local/tomcat/inigral-sso/tomcat-inigral-sso stop"
   restart program = "/usr/local/tomcat/inigral-sso/tomcat-inigral-sso restart"
   if failed port 58097 and protocol http retry 6
      then restart
   group tomcat

check file tomcat-inigral-sso-logs with path /usr/local/tomcat/inigral-sso/logs/catalina.out
   not every "* 3-6 * * *"
   restart program = "/usr/local/tomcat/inigral-sso/tomcat-inigral-sso restart"
   if match
      "^.*OutOfMemoryError.*$"
      then restart
   group tomcat
   depends on tomcat-inigral-sso

Comments (13)

  1. Tildeslash repo owner

    The pidfile based process check reads the PID from the file and checks if the process is running - in this case some other process was up with the same PID most probably => the pidfile pointed to valid process.

    The pidfile should be stored in tmpfs/ramdisk (for example /var/run/) so it's removed on reboot. Another option is to use pattern based process check ("check process myprocess matching <pattern>")

  2. Aaron Echols (dasunsrule32) reporter

    I did want to add, that this was doing this for over 12 hours. I would think that at some point, since the http retry checks should've have been failing, it would've kicked off a restart of the process. This never happened, this seems like a bug to me. I also checked memory during this time, and there were no other PID's that matched, yet monit still was thinking that it was up.

  3. Aaron Echols (dasunsrule32) reporter

    Yes, i'll kick off another reboot tonight and upload the logs tomorrow morning. It will respawn with the verbose logs tonight on boot. Thanks.

    # DEBUG:
    mo:2345:respawn:/usr/local/monit/bin/monit -v -c /usr/local/monit/conf/monitrc
    
  4. Tildeslash repo owner

    Hello Aaron, thanks for data and sorry for late response.

    According to the status output values (PPID, UID, GID, etc.) it seems that the process was not found in the process tree, but the getpgid() systemcall which is used internally to check if the PID from the pidfile is running, returned either success or EPERM, even though the process is not running.

    The workaround could be to use the pattern based process check if each catalina instance has unique pattern - such check is based on process tree snap rather then getpgid(PID).

    Will try to reproduce the problem again.

  5. Tildeslash repo owner

    Tried to reproduce on various linux distributions (Ubuntu 14.10, CentOS 5.11, CentOS 6.6) ... set monitoring of 1000 non-existent processes with pidfiles stored in persistent directory (each pidfile contained PID which doesn't correspond to any running process), rebooted the machine - after reboot Monit detected that none of the monitored process is running, despite the pidfile did exist.

    Based on the data it seems that the culprit is getpgid() call ... if you can test, i can prepare debug version, which will log more details when it's called and compare it to process tree snapshot.

    Possible workarounds: 1.) either place the pidfile to ramdisk, which will reset on reboot (/var/run/ should be fine) 2.) or use the pattern based process check

    We plan to refactor the process engine in the (near) future, will keep the getpgid() problems in mind.

  6. Gábor Garami

    Some notes: Aaron's problem is not only the pid is existing with a different executable, but service running state is trusted just because the connection test is success. I ususally use connection testing to validate the service running and - as a side effect - validating the PID file is pointing to the right process. We have to make sure we are watching the right process.

    My basic idea is just cache the pid executable path internally and if it is changes (but the pid keeps unchanged) then make at least an alert. Keep in your mind, the executable path is not neccessarily same as it seems like in ps (that's the $0/argv[0] variable and can be overridden).

  7. Shrenik

    Hey @Aaron Echols (dasunsrule32) I'm facing a similar issue, the only difference is the environment I'm running my services in. I have a docker container which has monit as pid 1 and is monitoring mongodb, vault and nginx. When I run my docker container for the first time, everything comes up properly , but when I do a docker stop "my container" followed by docker start "mycontainer", due to a residual vault pid file from the old startup, monit behaved weirdly and showed the status of vault as "Running" even though the process didn't exist. This is the Monit status for vault

    Process 'vault'
      status                            Running
      monitoring status                 Monitored
      pid                               -
      parent pid                        -
      uid                               -
      effective uid                     -
      gid                               -
      uptime                            -
      threads                           -
      children                          -
      memory                            -
      memory total                      -
      memory percent                    -
      memory percent total              -
      cpu percent                       -
      cpu percent total                 -
      data collected                    Mon, 26 Sep 2016 01:06:29
    

    This is a snippet from monit log file. Seems like it didn't do the test check for vault at all whether pid in the vault.pid file is same as the process running.

    [PDT Sep 26 01:15:12] debug    : 'mongodb' process test failed [pid=147] -- No such process
    [PDT Sep 26 01:15:12] info     : 'mongodb' start: /bin/bash
    [PDT Sep 26 01:15:12] debug    : 'mongodb' started
    [PDT Sep 26 01:15:12] info     : 'mongodb' process is running with pid 15
    [PDT Sep 26 01:15:12] debug    : 'mongodb' zombie check succeeded
    [PDT Sep 26 01:15:19] debug    : 'nginx' process test failed [pid=252] -- No such process
    [PDT Sep 26 01:15:19] info     : 'nginx' start: /sbin/start-stop-daemon
    [PDT Sep 26 01:15:19] debug    : 'nginx' started
    [PDT Sep 26 01:15:40] debug    : 'vault' process is running with pid 225
    [PDT Sep 26 01:15:40] debug    : 'vault' zombie check succeeded
    

    And this issue, sometimes occurs with vault or sometimes with some other process. But I'm unable to resolve the issue.

  8. Log in to comment