"TOTAL CPU" usage calculation incorrect

Issue #657 resolved
A Thinking Ape created an issue

This is related to the fix for Issue #230, commit 215da7aa86fd

Your calculation needs to sum up all the threads of the parent AND child processes, not just the threads of the parent process.

Say you have one parent process and 10 child processes (all single thread) on a 20 core machine. Even though you could be using 11 cores, your current calculation causes monit to report 100% utilization if each process uses 1/11th of a core.

Comments (14)

  1. Tildeslash repo owner

    The current calculation is correct.

    There are two CPU usage tests:

    1.) "if cpu ..." - this test calculates only the process itself

    2.) "if total cpu ..." - this test calculates the process AND all its children

  2. A Thinking Ape reporter

    I realize this, BUT "total cpu" check and the "total cpu" shown on the main page is wrong and exhibits the behavior I describe. "total cpu" does NOT account for the child processes and threads, when it should.

  3. A Thinking Ape reporter
    • changed status to open

    "total cpu" check and the "total cpu" shown on the main page is wrong and exhibits the behavior I describe.

  4. A Thinking Ape reporter

    Example: A two threaded process with 9 two threaded child processes running on a 20 core machine will show 100% total cpu utilization if each and trigger a 100% total cpu check even if each process is using 20% of the CPU. But total cpu should really only show 20% utilization (since you have 10 total processes, 20 total threads, on a 20 core machine, each running at 20% utilization).

  5. A Thinking Ape reporter

    I believe the root cause of this bug is in how you calculate cpu_total. https://bitbucket.org/tildeslash/monit/src/215da7aa86fdd33040980deaa1aff73189dd6e00/src/process.c?fileviewer=file-view-default

    Line 108.

    Here, you add up the percentage usages of all child processes into the parent. However, you don't adjust/compute the parent's divisor to account for number of child processes or threads.

    So what happens is that each child can be operating at, say, 20% utilization, but if you have 5 or more (even if you have, say, 100 processors), it will always come up with 100%.

  6. A Thinking Ape reporter

    FWIW, if you wanted to solve the original issue of detecting pegged child processes as well, you'd want to introduce a new check, like:

    "if per cpu..." that triggers if anyone of the child processes or parent processes goes above the given threshold.

    "if cpu..." would trigger if the parent process goes above threshold

    "if total cpu..." would trigger if the aggregate goes above threshold (based on a corrected calculation)

    "if per cpu..." would trigger if any of the processes went above the threshold

  7. Tildeslash repo owner

    Hello, the problem is fixed.

    If you want, you can test the development version :

    wget https://bitbucket.org/tildeslash/monit/get/master.tar.gz
    tar -xzf master.tar.gz
    cd tildeslash*
    ./bootstrap
    ./configure
    make
    
  8. A Thinking Ape reporter

    Hi, 5.24.0 appears to have CPU % calculation broken (for total CPU). It is showing 0% CPU utilization in the cases above (child processes with activity).

  9. A Thinking Ape reporter
    • changed status to open

    5.24.0 appears to have CPU % calculation broken (for total CPU). It is showing 0% CPU utilization in the cases above (child processes with activity).

  10. Tildeslash repo owner

    I'm sorry, the problem is fixed now, you can test if you want:

    wget https://bitbucket.org/tildeslash/monit/get/master.tar.gz
    tar -xzf master.tar.gz
    cd tildeslash*
    ./bootstrap
    ./configure
    make
    
  11. Log in to comment