children count triggers false negative alerts

Issue #705 resolved
Former user created an issue

Using 5.20, with the following:

check process nginx with pidfile "/var/run/nginx.pid"
  if failed (url http://localhost/monit and content == "monit-check") then alert
  if children < 4 then alert
check process keepalived with pidfile /var/run/keepalived.pid
  if children < 2 then alert

I sometime get false positive alarms that says the parent process has 0 child, and the alert only lasts for a cycle, the check is back to normal on the next cycle.

[CET Jan  6 23:57:52] error    : 'nginx' children count 0 matches resource limit [children<4]
[CET Jan  6 23:57:52] error    : 'keepalived' children count 0 matches resource limit [children<2]
[CET Jan  6 23:59:52] info     : 'nginx' children check succeeded [current children=4]
[CET Jan  6 23:59:52] info     : 'keepalived' children check succeeded [current children=2]

Problem is, the children never went away, as they're running since much longer... so something weird happens ?

root     10700  0.0  0.0  51572    56 ?        Ss    2016  26:42 /usr/sbin/keepalived
root     10701  0.0  0.0  57964    60 ?        S     2016  27:34  \_ /usr/sbin/keepalived
root     10702  0.0  0.0  57840    84 ?        S     2016  67:56  \_ /usr/sbin/keepalived
root     32267  0.0  0.0 109548  1260 ?        Ss    2017   0:00 nginx: master process /usr/sbin/nginx -g daemon on; master_process on;
www-data 11979  0.0  0.0 111992  5792 ?        S     2017  25:30  \_ nginx: worker process                           
www-data 11980  0.0  0.0 112072  6340 ?        S     2017  24:56  \_ nginx: worker process                           
www-data 11981  0.0  0.0 112572  6096 ?        S     2017  25:15  \_ nginx: worker process                           
www-data 11982  0.0  0.0 112188  6504 ?        S     2017  25:03  \_ nginx: worker process       

i can of course work it around by adding 'for 2 cycles' to the alert but that's awkward :)

Comments (7)

  1. Landry Breuil

    Forgot to mention this was on Linux (debian jessie) using jessie-backports packages from debian

  2. Landry Breuil

    I'm looking at the code doing the ProcessTree handling, but i dont see what could cause those 'failures'...

  3. Tildeslash repo owner

    There was a problem, where may fail to connect children to parent process: monit snaps the process tree and then collects details for each process - if some process stopped before monit collected the details, it may break the tree.

    The fix was checked in to the development repository, you can test it if you want:

    wget https://bitbucket.org/tildeslash/monit/get/master.tar.gz
    tar -xzf master.tar.gz
    cd tildeslash*
    ./bootstrap
    ./configure
    make
    
  4. Tildeslash repo owner

    there were multiple checkins with fixes and code cleanup, i suggest the whole repository snapshot (there's no single checkin i can point to)

  5. Landry Breuil

    After having run master as of d7af20a for a bunch of days on 4 debian servers, i can confirm that the false positive alarms about missing childs are gone. Definitely an improvement :) I suppose this ticket can be closed as fixed in the upcoming 5.25.2 or 5.26 release. Thanks!

  6. Log in to comment