- edited description
total system cpu includes 'wait' which is confusing and not convenient
Frequently CPU needs monitoring to detect cpu 'burning', but, your notion of 'total cpu' includes [io]wait time:
CPU([user|system|wait]) is the percent of time the system spend in user or kernel space and I/O. The user/system/wait modifier is optional, if not used, the total system cpu usage is tested. [1]
[IO]wait time is essentially kind of idle time. Please, provide way to test true cpu load with out any idle times.
Presence of separate tests for CPU(user) and CPU(system) does not replace (absence of) true total cpu, since, only user+system (which, I guess, should include irq time, but you don't document it, so I don't sure) consisting true cpu load. Idle and iowait times are in essence cpu halt time, so, they are not interesting when task is to detect cpu spinning.
Examples,
If I want to detect 90% cpu load, what should I monitor? cpu(user) > 90%
and cpu(system) > 90%
would not catch if both are at 45% (which is true cpu 90%). If I set cpu > 90%
, but when, for example, soft raid is checking, iowait could be very high triggering alert, which is false positive.
No way to monitor actual cpu load with monit. Please provide method to monitor it.
[1] https://mmonit.com/monit/documentation/monit.html#RESOURCE-TESTING
Comments (7)
-
reporter -
repo owner - changed status to closed
The CPU wait is problem ... if the process is blocked by waiting for I/O, the application is slow - normal cpu wait% should be close to zero. Simple total cpu usage test which includes usr+sys+wait will catch all kinds of cpu state related problems.
As you noted, there are also modifiers, which you can use to select just usr/sys/wait ... we don't plan to add usr+sys combination only.
-
reporter Please, reconsider to add user+system as true cpu load metric. Because, it's completely different metric than user+system+wait.
For example, 'idle' time could be "the process is blocked by network I/O, the application becomes slow". But, you don't add it up into total cpu, don't you? IOwait is IDLE on disk I/O (actually, just un-interruptible sleep). People not by accident invented different metric, to measure for different purposes, which you throwing away.
See, when user+system is HIGH, cpu is consuming power (watts, this could be very important for colocated server because electricity is billed separately), so it's different problem than 'application become slow'. User+system AND wait could be safely measured in separate tests, because, iowait is different problem cause (disk I/O) than cpu resources. But to measure system AND user time separately is rarely useful. Your configuration allow to measure user and system separately, which is less useful than measuring them together.
Why not to add flexibility to use cases? Don't you write about dialectical approach on change and innovation, but, denying implementing simple but useful feature?
-
reporter - changed status to open
Please read my new comment and reconsider.
-
reporter Couple more arguments that wait time is nothing more than type of idle time.
1. If you use networked filesystem which delays on disk IO you will not see wait time. This "application is slow" will go straight to the idle time, because waiting on network functions is sleep (S state).
2. You can do simple experiment yourself. Install
fio
and any cpu burner, like stress-ng.Then run
fio
with this config (adapted to 12 core system and SATA disk/dev/sda
) on idle system:[global] ioengine=sync direct=1 norandommap=1 filename=/dev/sda runtime=10000 [random-read] rw=randread bs=4K iodepth=8 numjobs=16
Note
fio
speeds in its output:Jobs: 16 (f=16): [r(16)] [0.4% done] [491KB/0KB/0KB /s] [122/0/0 iops] [eta 02h:45m:56s]
This will create considerable amount of iowait time,
top
output:%Cpu0 : 0.3 us, 0.3 sy, 0.0 ni, 0.0 id, 99.3 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 3.3 us, 2.0 sy, 0.0 ni, 0.0 id, 94.7 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 : 9.4 us, 0.7 sy, 0.0 ni, 18.7 id, 71.2 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 : 0.3 us, 0.3 sy, 0.0 ni, 71.7 id, 27.7 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 : 0.3 us, 0.0 sy, 0.0 ni, 76.3 id, 23.3 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 8.3 us, 0.3 sy, 0.0 ni, 47.2 id, 44.2 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu6 : 0.0 us, 0.3 sy, 0.0 ni, 98.7 id, 1.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 : 0.0 us, 0.0 sy, 0.0 ni, 0.0 id,100.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 : 0.0 us, 0.3 sy, 0.0 ni, 0.0 id, 99.7 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 : 0.3 us, 0.3 sy, 0.0 ni, 0.0 id, 99.3 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 1.3 us, 0.3 sy, 0.0 ni, 27.8 id, 70.5 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 0.0 us, 0.7 sy, 0.0 ni, 0.0 id, 99.3 wa, 0.0 hi, 0.0 si, 0.0 st
Then run cpu burner
stress-ng --cpu 12
in my case,top
output:%Cpu0 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu1 : 96.7 us, 2.7 sy, 0.0 ni, 0.7 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu2 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu3 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu4 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu5 : 99.7 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.3 si, 0.0 st %Cpu6 : 99.3 us, 0.7 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu7 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu8 :100.0 us, 0.0 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu9 : 99.7 us, 0.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu10 : 99.7 us, 0.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st %Cpu11 : 99.7 us, 0.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.0 si, 0.0 st
Where's these iowaits? Lets check
fio
output to see it's not slowed down:Jobs: 16 (f=16): [r(16)] [1.6% done] [556KB/0KB/0KB /s] [139/0/0 iops] [eta 02h:43m:58s]
This all is because wait time is essentially just idle time when there's outstanding block device requests.
-
repo owner - changed status to on hold
will look on it later
-
repo owner - removed version
Removing version: 5.14 (automated comment)
- Log in to comment