cpu per-core monitoring, quiet conditional monitoring (noalerts), and meaning of tests

Excuse me, for this request will be three different problems together, but they happened actually within single (supposed to be simple, but isn't) monitoring task.

There is actual bug in libreswan witch make pluto (key exchange daemon for ipsec) hang with cpu 100%. https://github.com/libreswan/libreswan/issues/79

-1- First, I tried to create this simple monitoring:

check process pluto with pidfile /var/run/pluto/pluto.pid
    stop  = "/usr/bin/killall -9 pluto"
    start = "/sbin/service ipsec restart"
  if cpu > 12% for 5 cycles then restart

-2- This is not working as intended, because, pluto sometimes is up, sometimes is down, and I don't need to restart it when it's legally down, I only need it restarted when it's up and hangs with 100% load on single CPU (infinite loop or some syscalls).

-3- I can not monitor with monit when single process consumes 100% cpu time, because, monit divides cpu time by number of cpus. This makes monit tests highly system dependent, because, some system have 2 cpu, some 8, etc. With all this difference, maximum cpu load value should be recalculated by human operator for each system.

And I think this also important to note, that this also make monitoring statement to mean different thing from what it states, which, in turn, requires more documentation and thought!

To have monitoring statements mean exactly what is intended to be monitored would be virtue. People tend to forget nuances when meaning of statement is not direct, thsi produces actual mistakes with copy-paste or blind copying files between systems.

-4- While for many situations it's good to have cpu load normalized with total cpu power, like you already have (I didn't found this explained in documentation, though). But, for many other situations it would be useful to have cpu non-scaled down.

Maybe, you could add specifier to cpu test to make cpu value non-scaled, like if core cpu > 99% then restart? And if core cpu >= 200% then at least two cores are loaded.

It is useful to count not only form total to fractions, but also to count from 1 up.

-5- Then I tried to solve my problem this way:

check process pluto-daemon with pidfile /var/run/pluto/pluto.pid
  if not exist then alert
     noalert admin1@gmail.com
     noalert admin2@gmail.com
     noalert admin3@gmail.com

check process pluto-cpu with pidfile /var/run/pluto/pluto.pid
    depends on pluto-daemon
    stop  = "/usr/bin/killall -9 pluto"
    start = "/sbin/service ipsec restart"
  if cpu > 12% for 5 cycles then restart

-6- I, actually, don't need any alert from pluto-daemon test, it's only condition for pluto-cpu test to run. Because I have several admin emails, It would be useful to suppress any alerts for the condition test w/out specifying all default emails (in complicated way). I already asked this in https://bitbucket.org/tildeslash/monit/issues/436/simplified-noalert-for-all-global-set for longer lists, but would be useful even for smaller admin lists, would avoid duplication of information (because code or config duplication is error prone), simplify configs (includes are still complicated), make things easy understandable (statement means exactly what is states).

Comments (6)