"check match" occasionnaly fails to find a running process

Issue #60 resolved
Olivier Grange-Labat
created an issue

Hello,

We're trying to monitor Redis running on two hosts (CentOS 5), running several primary and replica instances of redis.

For that, we've created several "check match" ("match" instead of "with pidfile" because Redis fails to create them files although they are set in config files with "pidfile /var/run/redis_XXX.pid" -- but that's another point):

check process redis.serviceA match "redis[0-9]+\..*\.serviceA"
       if failed
          port 6380
          send "SET MONIT-CHECK ok\r\n"
          expect "OK"
          send "EXISTS MONIT-CHECK\r\n"
          expect ":1"
       then alert
       if 5 restarts within 5 cycles then timeout

check process redis.serviceB match "redis[0-9]+\..*\.serviceB"
       if failed
          port 6381
          send "SET MONIT-CHECK ok\r\n"
          expect "OK"
          send "EXISTS MONIT-CHECK\r\n"
          expect ":1"
       then alert
       if 5 restarts within 5 cycles then timeout

check process redis.serviceC match "redis[0-9]+\..*\.serviceC"
       if failed
          port 6382
          send "SET MONIT-CHECK ok\r\n"
          expect "OK"
          send "EXISTS MONIT-CHECK\r\n"
          expect ":1"
       then alert
       if 5 restarts within 5 cycles then timeout

check process redis.serviceD match "redis[0-9]+\..*\.serviceD"
       if failed
          port 6383
          send "SET MONIT-CHECK ok\r\n"
          expect "OK"
          send "EXISTS MONIT-CHECK\r\n"
          expect ":1"
       then alert
       if 5 restarts within 5 cycles then timeout

monit occasionally fails to find these Redis process, reporting "process is not running". On the next cycle, it actually reports the process running with the same pid as before the failed check:

Jun 10 00:55:15 db0-redis monit[20519]: 'redis.serviceB' process is running with pid 7170
Jun 10 00:55:35 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 01:20:53 db0-redis monit[20519]: 'redis.serviceD' process is not running
Jun 10 01:20:53 db0-redis monit[20519]: 'redis.serviceD' trying to restart
Jun 10 01:22:25 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 01:47:43 db0-redis monit[20519]: 'redis.serviceD' process is not running
Jun 10 01:47:43 db0-redis monit[20519]: 'redis.serviceD' trying to restart
Jun 10 01:49:14 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 02:14:32 db0-redis monit[20519]: 'redis.serviceD' process is not running
Jun 10 02:14:32 db0-redis monit[20519]: 'redis.serviceD' trying to restart
Jun 10 02:16:04 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 02:41:22 db0-redis monit[20519]: 'redis.serviceD' process is not running
Jun 10 02:41:22 db0-redis monit[20519]: 'redis.serviceD' trying to restart
Jun 10 02:42:53 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 03:26:25 db0-redis monit[20519]: 'redis.serviceB' process is not running
Jun 10 03:26:25 db0-redis monit[20519]: 'redis.serviceB' trying to restart
Jun 10 03:27:56 db0-redis monit[20519]: 'redis.serviceB' process is running with pid 7170
Jun 10 03:34:41 db0-redis monit[20519]: 'redis.serviceB' process is not running
Jun 10 03:34:41 db0-redis monit[20519]: 'redis.serviceB' trying to restart
Jun 10 03:36:13 db0-redis monit[20519]: 'redis.serviceB' process is running with pid 7170
Jun 10 04:44:01 db0-redis monit[20519]: 'redis.serviceD' process is not running
Jun 10 04:44:01 db0-redis monit[20519]: 'redis.serviceD' trying to restart
Jun 10 04:45:33 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 04:58:42 db0-redis monit[20519]: 'redis.serviceB' process is not running
Jun 10 04:58:42 db0-redis monit[20519]: 'redis.serviceB' trying to restart
Jun 10 05:00:14 db0-redis monit[20519]: 'redis.serviceB' process is running with pid 7170
Jun 10 05:47:28 db0-redis monit[20519]: 'redis.serviceB' process is not running
Jun 10 05:47:28 db0-redis monit[20519]: 'redis.serviceB' trying to restart
Jun 10 05:47:38 db0-redis monit[20519]: 'redis.serviceD' process is not running
Jun 10 05:47:38 db0-redis monit[20519]: 'redis.serviceD' trying to restart
Jun 10 05:48:49 db0-redis monit[20519]: 'redis.serviceB' process is running with pid 7170
Jun 10 05:49:09 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 05:58:57 db0-redis monit[20519]: 'redis.serviceB' process is not running
Jun 10 05:58:57 db0-redis monit[20519]: 'redis.serviceB' trying to restart
Jun 10 06:00:28 db0-redis monit[20519]: 'redis.serviceB' process is running with pid 7170
Jun 10 06:07:13 db0-redis monit[20519]: 'redis.serviceB' process is not running
Jun 10 06:07:13 db0-redis monit[20519]: 'redis.serviceB' trying to restart
Jun 10 06:08:44 db0-redis monit[20519]: 'redis.serviceB' process is running with pid 7170
Jun 10 06:09:04 db0-redis monit[20519]: 'redis.serviceD' process is not running
Jun 10 06:09:04 db0-redis monit[20519]: 'redis.serviceD' trying to restart
Jun 10 06:10:35 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 06:40:57 db0-redis monit[20519]: 'redis.serviceD' process is not running
Jun 10 06:40:57 db0-redis monit[20519]: 'redis.serviceD' trying to restart
Jun 10 06:42:28 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 07:12:50 db0-redis monit[20519]: 'redis.serviceD' process is not running
Jun 10 07:12:50 db0-redis monit[20519]: 'redis.serviceD' trying to restart
Jun 10 07:14:22 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 07:44:44 db0-redis monit[20519]: 'redis.serviceD' process is not running
Jun 10 07:44:44 db0-redis monit[20519]: 'redis.serviceD' trying to restart
Jun 10 07:46:15 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 08:16:37 db0-redis monit[20519]: 'redis.serviceD' process is not running
Jun 10 08:16:37 db0-redis monit[20519]: 'redis.serviceD' trying to restart
Jun 10 08:18:08 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 08:31:18 db0-redis monit[20519]: 'redis.serviceB' process is not running
Jun 10 08:31:18 db0-redis monit[20519]: 'redis.serviceB' trying to restart
Jun 10 08:32:49 db0-redis monit[20519]: 'redis.serviceB' process is running with pid 7170
Jun 10 08:43:17 db0-redis monit[20519]: 'redis.serviceD' process is not running
Jun 10 08:43:17 db0-redis monit[20519]: 'redis.serviceD' trying to restart
Jun 10 08:44:48 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 09:10:06 db0-redis monit[20519]: 'redis.serviceD' process is not running
Jun 10 09:10:06 db0-redis monit[20519]: 'redis.serviceD' trying to restart
Jun 10 09:11:37 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 09:41:59 db0-redis monit[20519]: 'redis.serviceD' process is not running
Jun 10 09:41:59 db0-redis monit[20519]: 'redis.serviceD' trying to restart
Jun 10 09:43:31 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 10:08:49 db0-redis monit[20519]: 'redis.serviceD' process is not running
Jun 10 10:08:49 db0-redis monit[20519]: 'redis.serviceD' trying to restart
Jun 10 10:10:20 db0-redis monit[20519]: 'redis.serviceD' process is running with pid 6923
Jun 10 10:20:08 db0-redis monit[20519]: 'redis.serviceB' process is not running
Jun 10 10:20:08 db0-redis monit[20519]: 'redis.serviceB' trying to restart
Jun 10 10:21:39 db0-redis monit[20519]: 'redis.serviceB' process is running with pid 7170
Jun 10 10:39:01 db0-redis monit[20519]: Shutting down monit HTTP server
Jun 10 10:39:02 db0-redis monit[20519]: monit HTTP server stopped
Jun 10 10:39:02 db0-redis monit[20519]: M/Monit heartbeat stopped
Jun 10 10:39:02 db0-redis monit[20519]: monit daemon with pid [20519] killed
Jun 10 10:39:02 db0-redis monit[20519]: 'redis' Monit stopped
Jun 10 10:39:08 db0-redis monit[11018]: Invalid argument -- h  (-h will show valid arguments)
Jun 10 10:39:22 db0-redis monit[11132]: Starting monit daemon with http interface at [localhost:2812]
Jun 10 10:39:22 db0-redis monit[11132]: Starting monit HTTP server at [localhost:2812]
Jun 10 10:39:22 db0-redis monit[11132]: monit HTTP server started
Jun 10 10:39:22 db0-redis monit[11132]: 'redis' Monit started
Jun 10 10:39:22 db0-redis monit[11132]: M/Monit heartbeat started
Jun 10 10:39:22 db0-redis monit[11132]: skipping queued event /var/monit/id - unknown data format
Jun 10 10:39:22 db0-redis monit[11132]: skipping queued event /var/monit/state - unknown data format
Jun 10 10:51:42 db0-redis monit[11132]: 'redis.serviceD' process is not running
Jun 10 10:51:42 db0-redis monit[11132]: 'redis.serviceD' trying to restart
Jun 10 10:53:13 db0-redis monit[11132]: 'redis.serviceD' process is running with pid 6923
Jun 10 11:23:35 db0-redis monit[11132]: 'redis.serviceD' process is not running
Jun 10 11:23:35 db0-redis monit[11132]: 'redis.serviceD' trying to restart
Jun 10 11:25:06 db0-redis monit[11132]: 'redis.serviceD' process is running with pid 6923
Jun 10 11:26:27 db0-redis monit[11132]: 'redis.serviceB' process is not running
Jun 10 11:26:27 db0-redis monit[11132]: 'redis.serviceB' trying to restart
Jun 10 11:27:58 db0-redis monit[11132]: 'redis.serviceB' process is running with pid 7170
Jun 10 11:45:11 db0-redis monit[11132]: 'redis.serviceD' process is not running
Jun 10 11:45:11 db0-redis monit[11132]: 'redis.serviceD' trying to restart
Jun 10 11:46:42 db0-redis monit[11132]: 'redis.serviceD' process is running with pid 6923
Jun 10 12:17:04 db0-redis monit[11132]: 'redis.serviceD' process is not running
Jun 10 12:17:04 db0-redis monit[11132]: 'redis.serviceD' trying to restart
Jun 10 12:18:35 db0-redis monit[11132]: 'redis.serviceD' process is running with pid 6923
Jun 10 12:34:01 db0-redis monit[11132]: Shutting down monit HTTP server
Jun 10 12:34:02 db0-redis monit[11132]: monit HTTP server stopped
Jun 10 12:34:02 db0-redis monit[11132]: M/Monit heartbeat stopped
Jun 10 12:34:02 db0-redis monit[11132]: monit daemon with pid [11132] killed
Jun 10 12:34:02 db0-redis monit[11132]: 'redis' Monit stopped
Jun 10 12:34:06 db0-redis monit[5086]: Starting monit daemon with http interface at [localhost:2812]
Jun 10 12:34:06 db0-redis monit[5088]: Starting monit HTTP server at [localhost:2812]
Jun 10 12:34:06 db0-redis monit[5088]: monit HTTP server started
Jun 10 12:34:06 db0-redis monit[5088]: 'redis' Monit started
Jun 10 12:34:06 db0-redis monit[5088]: M/Monit heartbeat started
Jun 10 12:34:06 db0-redis monit[5088]: skipping queued event /var/monit/id - unknown data format
Jun 10 12:34:06 db0-redis monit[5088]: skipping queued event /var/monit/state - unknown data format

...although they have been running smoothly for weeks:

[root@redis ~]# ps auxwwww|grep redis
redis     6756  0.0  0.3 314124 178724 ?       Ssl  Mar04  18:44 /opt/redis-2.4.6/bin/redis-server /etc/redis/conf/redis246.master.serviceC.conf
redis     6923  1.6  6.3 3615500 3132904 ?     Ssl  Mar04 2294:11 /opt/redis-2.4.6/bin/redis-server /etc/redis/conf/redis246.master.serviceD.conf
redis     7045  0.4  0.0  97036  9100 ?        Ssl  Mar04 625:18 /opt/redis-2.4.6/bin/redis-server /etc/redis/conf/redis246.master.serviceA.conf
redis     7170  4.1  0.6 2063116 320524 ?      Ssl  Mar04 5895:47 /opt/redis-2.4.6/bin/redis-server /etc/redis/conf/redis246.master.serviceB.conf
root      9290  0.0  0.0  63304   832 pts/1    S+   11:38   0:00 grep redis
[root@redis ~]#

regexps used have been tested w/ monit procmatch successfully (they report 1 match).

I've tried running monit w/ monit -I -vv but it didn't help finding out the root cause. Checking the source files seems to show that this could be a getpgid() problem.

Thank you

Comments (8)

  1. Olivier Grange-Labat reporter

    I just wanted to let you know that due to the nature of this bug, we'll move from a "check process match" to a "check process pidfile" on monday so won't be able to help reproducing the problem.

    Thank you

  2. Tildeslash repo owner

    Hello Olivier,

    the problem is caused when the process uses fork/clone to create some child process. That creates temporary copy of the parent process with identical pattern, followed usually by exec which replaces the image. Monit does snapshot of the process tree on the start of its monitoring cycle and then applies the tests - when it searches for the pattern in the process tree snapshot, it finds the PID of the first matching process (child after fork/clone, but before exec) and it verifies that this PID is running right now as there is some delay since snapshot and the process may die in the meantime ... because the child exited already, there is no such process and an event is triggered. Next cycle the original parent process is found, as its the only pattern matching process.

    Redis uses fork to create child processes => the pattern is not so unique at all times as it may seem.

    We'll fix this problem in the future version (we're working on 5.9 release, it won't make it to this version), in the case that there will be subtree with multiple matches, we'll prefer the parent. Putting the issue on hold for this release.

    The pattern based process check should not be used for applications which create child processes, the pidfile based test should be used in such case (as you did already).

    Regards, The Monit team

  3. Tildeslash repo owner

    Fix Issue #60: The process match check may report false positive ... if we find matching process and it is not running, scan the rest of process tree.

    Fix Issue #214: The process match check has high CPU requirements on low power systems ... use cached PID.

    → <<cset e5bbe778c8be>>

  4. Log in to comment