monit cpu usage is nonsense on production environment

Issue #230 resolved
Ilya Trushchenko created an issue

I just can't normally monitor and kill a runaway process on a production server with 48 cpus. The process stuck at 100% CPU, but monit reports 100/48= 2.08 CPU usage. These numbers are extremely unreliable, totally dependent on number of cores on server (so I can't even deploy the same config to servers with different cpu count). And the last nail in the coffin - monit forbids to use float in CPU thresholds! So if I want to kill a process if it uses >70% CPU - I can choose between 48 * 1 = 48% or 48 * 2 = 96%... Please provide something like RAW CPU which is not divided by number of cores so Monit (and MMonit) can be used in real production environment. Thank you

Comments (13)

  1. Tildeslash repo owner

    Traditional CPU usage means 0-400% on a 4 core system. We use 0-100% independent of cores. For systems like node which mostly runs on one core, using the traditional format might make more sense so it is easier to see if a program is using 100% CPU core. Because we need to maintain backward compatibility a new traditional keyword will have to be introduced, like

     if cpu usage traditional > 300% then alert
    

    Makes sense?

  2. Ilya Trushchenko reporter

    yep. It would be nice if it includes child cpu usage also. And in mMonit - to have a flag in settings "use traditional CPU for graphs"

  3. Tildeslash repo owner

    We will implement automatic scaling of the cpu usage based on number of threads of the monitored application vs. available CPUS.

    The 100% CPU use will mean that all threads are utilizing maximum available resources ... for example single-threaded application max. will equal to 100% use of single CPU core. Multi-threaded application will follow similar formula based on number of their threads (number of threads may be also higher then number of available CPU cores - in such case it'll be aggregate of all threads vs. usage of all available CPU cores).

    Putting on hold till the implementation will begin.

  4. Ilya Trushchenko reporter

    If I've got 32 core server and nginx with 12 threads uses 250% CPU according to $ top, what will monit show in this case? What it will show in case of 16 core server?

  5. Tildeslash repo owner

    Per top's terminology these 12 threads use 2.5 cores in total => in monit's new dynamic scaling it'll mean 100 * 2.5 / 12 = 20.8% CPU usage.

    If all 12 threads will run on 100% it'll show up as 1200% in top's terminology (both on 16 or 32 cores machine), but "only" as 800% on a 8 cores machine.

    In monit's terminology it'll scale automatically up to 100% making the number of machine cores internal factor, i.e. no need to think about number of cores in "traditional" syntax => monit will be natural.

  6. Tildeslash repo owner

    Fix Issue #230: The process CPU usage calculation now reflects the number of process threads. Originally monit showed process' CPU usage as its fraction of all available CPU resources utilization (number of CPU cores). For single-threaded applications that was however tricky, as such process may utilize one CPU core only and if it was working on its limits, on 8-CPU-cores machine monit showed 12.5% CPU utilization (100/8). If you wanted to check that the process is not stuck on 100%, you had to adjust the limit reflecting the CPU cores on the machine. Monit now calculates the CPU usage based on number of threads vs. available CPU cores. If the process has one thread, the 100% CPU usage equals to 100% utilization of one CPU core. If it has 2 threads, 100% CPU usage is reported when it uses 2 CPU cores on 100%, etc. If the process has more threads then the machine's available CPU cores, then the 100% CPU usage corresponds to utilization of all available CPU cores.

    → <<cset 215da7aa86fd>>

  7. Log in to comment