children count triggers false negative alerts
Using 5.20, with the following:
check process nginx with pidfile "/var/run/nginx.pid" if failed (url http://localhost/monit and content == "monit-check") then alert if children < 4 then alert check process keepalived with pidfile /var/run/keepalived.pid if children < 2 then alert
I sometime get false positive alarms that says the parent process has 0 child, and the alert only lasts for a cycle, the check is back to normal on the next cycle.
[CET Jan 6 23:57:52] error : 'nginx' children count 0 matches resource limit [children<4] [CET Jan 6 23:57:52] error : 'keepalived' children count 0 matches resource limit [children<2] [CET Jan 6 23:59:52] info : 'nginx' children check succeeded [current children=4] [CET Jan 6 23:59:52] info : 'keepalived' children check succeeded [current children=2]
Problem is, the children never went away, as they're running since much longer... so something weird happens ?
root 10700 0.0 0.0 51572 56 ? Ss 2016 26:42 /usr/sbin/keepalived root 10701 0.0 0.0 57964 60 ? S 2016 27:34 \_ /usr/sbin/keepalived root 10702 0.0 0.0 57840 84 ? S 2016 67:56 \_ /usr/sbin/keepalived root 32267 0.0 0.0 109548 1260 ? Ss 2017 0:00 nginx: master process /usr/sbin/nginx -g daemon on; master_process on; www-data 11979 0.0 0.0 111992 5792 ? S 2017 25:30 \_ nginx: worker process www-data 11980 0.0 0.0 112072 6340 ? S 2017 24:56 \_ nginx: worker process www-data 11981 0.0 0.0 112572 6096 ? S 2017 25:15 \_ nginx: worker process www-data 11982 0.0 0.0 112188 6504 ? S 2017 25:03 \_ nginx: worker process
i can of course work it around by adding 'for 2 cycles' to the alert but that's awkward :)
Comments (7)
-
-
I'm looking at the code doing the ProcessTree handling, but i dont see what could cause those 'failures'...
-
repo owner There was a problem, where may fail to connect children to parent process: monit snaps the process tree and then collects details for each process - if some process stopped before monit collected the details, it may break the tree.
The fix was checked in to the development repository, you can test it if you want:
wget https://bitbucket.org/tildeslash/monit/get/master.tar.gz tar -xzf master.tar.gz cd tildeslash* ./bootstrap ./configure make
-
Ah, good to know, which one is the corresponding commit/issue so that i can properly test it ? Is it 68e379f ?
-
repo owner there were multiple checkins with fixes and code cleanup, i suggest the whole repository snapshot (there's no single checkin i can point to)
-
After having run master as of d7af20a for a bunch of days on 4 debian servers, i can confirm that the false positive alarms about missing childs are gone. Definitely an improvement :) I suppose this ticket can be closed as fixed in the upcoming 5.25.2 or 5.26 release. Thanks!
-
repo owner - changed status to resolved
- Log in to comment
Forgot to mention this was on Linux (debian jessie) using jessie-backports packages from debian