Strange behaviour of total bytes monitoring on a network interface

Issue #172 on hold
Louis Sautier created an issue

Hi, I enabled monitoring of my eth0 interface and set an alert threshold at 90 GB per hour. Every time I receive a problem alert, I receive a recovery afterwards. This is what my log looks like: I do not understand how this computation could be right but maybe I'm missing something.

[CET Mar 12 12:09:12] error    : 'eth0' total upload 95.5 GB matches limit [upload rate > 90.0 GB in last 1 hour]
[CET Mar 12 12:10:17] info     : 'eth0' total upload check succeeded [current upload rate 44.0 GB in last 1 hour]
[CET Mar 12 12:37:14] error    : 'eth0' total upload 100.7 GB matches limit [upload rate > 90.0 GB in last 1 hour]
[CET Mar 12 12:38:19] info     : 'eth0' total upload check succeeded [current upload rate 54.6 GB in last 1 hour]

Comments (16)

  1. Louis Sautier reporter
    $ monit -V
    This is Monit version 5.12
    Copyright (C) 2001-2015 Tildeslash Ltd. All Rights Reserved.
    $ uname -m
    x86_64
    
  2. Tildeslash repo owner

    Please can you run monit in verbose mode and log output:

    1.) stop monit

    2.) start it in debug mode (make also sure monit logging is enabled - the "set logfile" statement is set):

    monit -v
    

    As soon as the problem will occur, please collect the monit log and send it to support@mmonit.com

  3. Tildeslash repo owner

    Hello Louis,

    yet thinking about it - the problem could be caused by long poll cycle ("set daemon" statement) ... the network statistics updates field of absolute values once per minute and when you need to read the value, it's sufficient to just subtract the current value and the previous value at index "now - 60 minutes" (with some magic for crossing the beginning of the hour).

    If your monit poll time is > 60 seconds (or monit is not able to check once per minute due to blocking on some other test), some values won't be updated once per minute and will retain old value from previous hour(s), so the difference then could be large, as it compares total of not just last hour, but last 2 or more hours.

    => please check your "set daemon" settings, it should be ca. 30 seconds at maximum for the network total upload test to work correctly.

  4. Louis Sautier reporter

    My set daemon setting is what I believe to be the default: 60 seconds. I'll set it to 30 seconds and see if the issue is still present.

  5. Tildeslash repo owner

    OK, that explains the problem ... the poll cycle makes monit sleep for 60s after all tests are finished => it's not exactly once per minute, but rather:

    1. test everything" (takes some time)
    2. sleep 60 seconds
    3. goto 1
    

    We plan to refactor the test scheduler in the near future to use non-blocking and support wall-clock driven tasks (will solve the update requirement for network test too).

    We have updated the default configuration file to use 30 seconds poll time + updated the manual for network tests to note the poll time requirement.

  6. Tildeslash repo owner

    fixed Issue #172 The total total upload/download test for last hour doesn't work if the value is not updated once per minute.

    Changed the default poll time in monit configuration file from 60 to 30 seconds and added note for these tests to the manual.

    → <<cset a5d70510de3f>>

  7. Tildeslash repo owner

    Update to 5.12.1 won't help - there were no changes in the network monitoring between 5.12 and 5.12.1.

    Please can you provide debug data for the whole hour? (21:49 - 22:50). If the monitoring cycle was delayed for some reason (timeout) and the data were not updated in 1 minute (even though the monitoring cycle is 30 seconds), then it would keep the old data in the statistics array.

    Has the frequency of the issue changed since you switch to 30 seconds cycle? (compared to previous 60 seconds cycle)

  8. Louis Sautier reporter

    Here are the log entries for eth0, I can provide the rest if necessary. I think that the frequency of these errors diminished but it's hard to tell because I have not been using this feature for very long (obviously).

  9. Tildeslash repo owner

    Thanks for data.

    The monitoring was really blocked by some test (for example on connection timeout of other service test) ... the statistics were not updated at 21:50 (update at 21:49:52, then no update until 21:51:02):

    [CET Mar 17 21:49:52] debug    : 'eth0' upload errors check succeeded
    [CET Mar 17 21:49:52] debug    : 'eth0' download errors check succeeded
    [CET Mar 17 21:49:52] debug    : 'eth0' total upload check succeeded [current upload rate 36.7 GB in last 1 hour]
    [CET Mar 17 21:49:52] debug    : 'eth0' total download check succeeded [current download rate 12.1 GB in last 1 hour]
    [CET Mar 17 21:51:02] debug    : 'eth0' link data gathering succeeded
    

    => when the "now-60 minutes" value is computed, it reads data from 21:50 slot, which was updated and contains the counter state from most probably 20:50 (as the data are ca. double of normal throughput => difference for 2 hours instead of just last hour).

    The root cause is the same as described in previous update - the statistics was not updated. The problem is caused by current test scheduler design (blocking), we'll fix this problem with the new scheduler.

    Will put the issue on hold until new scheduler work has started.

    As a workaround you can lower the test interval further - 10 seconds should be sufficient, as if monit won't sleep for 30 seconds in the 21:49:52 - 21:51:02 interval, it would get the statistics for 21:50. The workaround however doesn't guarantee that the statistics will be always collected and still can be blocked by some test timeout.

  10. Tildeslash repo owner

    putting on hold ... the new non-blocking scheduler will collect statistics once per minute independently of monit poll cycle

  11. Louis Sautier reporter

    Instead of trying to get the statistics for now - 1hour, couldn't you get use the first timestamp matching t >= (now - 1hour)? And using this timestamp, estimate the bandwidth used by computing an average like this : avg = 3600*(bw(now) - bw(t))/(now - t).

  12. Tildeslash repo owner

    The array doesn't store the timestamp ... just the value. The key to the array is current minute => we cannot compare timestamps - this approach is faster then searching for timestamp and also saves memory (no need to store the timestamp for each sample), but depends on statistics update on time. Unfortunately the current test scheduler model doesn't support wall-clock operations, as mentioned the new scheduler will fix the problem.

  13. Log in to comment