cpu per-core monitoring, quiet conditional monitoring (noalerts), and meaning of tests

Issue #525 closed
Don created an issue

Excuse me, for this request will be three different problems together, but they happened actually within single (supposed to be simple, but isn't) monitoring task.

There is actual bug in libreswan witch make pluto (key exchange daemon for ipsec) hang with cpu 100%. https://github.com/libreswan/libreswan/issues/79

-1- First, I tried to create this simple monitoring:

check process pluto with pidfile /var/run/pluto/pluto.pid
    stop  = "/usr/bin/killall -9 pluto"
    start = "/sbin/service ipsec restart"
  if cpu > 12% for 5 cycles then restart

-2- This is not working as intended, because, pluto sometimes is up, sometimes is down, and I don't need to restart it when it's legally down, I only need it restarted when it's up and hangs with 100% load on single CPU (infinite loop or some syscalls).

-3- I can not monitor with monit when single process consumes 100% cpu time, because, monit divides cpu time by number of cpus. This makes monit tests highly system dependent, because, some system have 2 cpu, some 8, etc. With all this difference, maximum cpu load value should be recalculated by human operator for each system.

And I think this also important to note, that this also make monitoring statement to mean different thing from what it states, which, in turn, requires more documentation and thought!

To have monitoring statements mean exactly what is intended to be monitored would be virtue. People tend to forget nuances when meaning of statement is not direct, thsi produces actual mistakes with copy-paste or blind copying files between systems.

-4- While for many situations it's good to have cpu load normalized with total cpu power, like you already have (I didn't found this explained in documentation, though). But, for many other situations it would be useful to have cpu non-scaled down.

Maybe, you could add specifier to cpu test to make cpu value non-scaled, like if core cpu > 99% then restart? And if core cpu >= 200% then at least two cores are loaded.

It is useful to count not only form total to fractions, but also to count from 1 up.

-5- Then I tried to solve my problem this way:

check process pluto-daemon with pidfile /var/run/pluto/pluto.pid
  if not exist then alert
     noalert admin1@gmail.com
     noalert admin2@gmail.com
     noalert admin3@gmail.com

check process pluto-cpu with pidfile /var/run/pluto/pluto.pid
    depends on pluto-daemon
    stop  = "/usr/bin/killall -9 pluto"
    start = "/sbin/service ipsec restart"
  if cpu > 12% for 5 cycles then restart

-6- I, actually, don't need any alert from pluto-daemon test, it's only condition for pluto-cpu test to run. Because I have several admin emails, It would be useful to suppress any alerts for the condition test w/out specifying all default emails (in complicated way). I already asked this in https://bitbucket.org/tildeslash/monit/issues/436/simplified-noalert-for-all-global-set for longer lists, but would be useful even for smaller admin lists, would avoid duplication of information (because code or config duplication is error prone), simplify configs (includes are still complicated), make things easy understandable (statement means exactly what is states).

Comments (6)

  1. Tildeslash repo owner

    Which monit version it is?

    Monit reflects the number of threads the process has since monit 5.16, so the cpu usage max is adaptive, based on threads count and cpu cores (whichever is lower). As pluto seems to be single-threaded application (https://libreswan.org/wiki/Pluto_internals), Monit 5.16 or later should report its CPU usage as ~100% if it spins.

    Excerpt from changelog which explains the CPU usage formula:

    Fixed: Issue #230: The process CPU usage calculation now reflects the number of
    process threads. Originally monit showed process' CPU usage as its fraction of
    all available CPU resources utilization (number of CPU cores). For single-threaded
    applications that was however tricky, as such process may utilize one CPU core only
    and if it was working on its limits, on 8-CPU-cores machine monit showed 12.5% CPU
    utilization (100/8). If you wanted to check that the process is not stuck on 100%,
    you had to adjust the limit reflecting the CPU cores on the machine. Monit now
    calculates the CPU usage based on number of threads vs. available CPU cores. If the
    process has one thread, the 100% CPU usage equals to 100% utilization of one CPU core.
    If it has 2 threads, 100% CPU usage is reported when it uses 2 CPU cores on 100%, etc.
    If the process has more threads then the machine's available CPU cores, then the 100%
    CPU usage corresponds to utilization of all available CPU cores.
    
  2. Don reporter

    Interesting. My monit on this box is 5.14 - default on Centos, from EPEL repo. But this make things even more complicated. (I think it should be described not just in changelog, but in docs.)

    I propose to add qualificator to 'cpu' keyword which will let admin to select approach for his needs based on his better understanding of what should be measured. For example adaptive cpu, scaled cpu, and core cpu.

    ps. I still can not configure monit to properly monitor pluto for cpu hangs. With above config I still getting 'process is not running'/'process is running with pid' alerts from pluto-cpu sensor. Even though it have depends on pluto-daemon. This is already taking too much time for supposedly simple monitoring task.

  3. Tildeslash repo owner

    Please upgrade monit to 5.16 or later - it'll solve the problem with no special configuration needed.

    The proposed cpu usage modifier will just turn the problem upside down in a less natural way ... the admin will have to set the limit based on number of process' threads, for example: core cpu usage > 400%. Monit presents all CPU resources available to the process as 100%.

  4. Log in to comment