Monit could not reset its counter for combination usage of `check program` and `exec`

Issue #1032 new
Yong Zhao created an issue

Recently we tried to leverage Monit to monitor the memory usage of container on a Linux host. The configuration of this service is as following and polling cycle of Monit in our configuration is 60 seconds.

check program container_memory_<container_name> with path "/path/of/check_script <container_name> <memory_threshold_value_in_bytes>"
   if status == 3 for 10 times within 20 cycles then exec "/path/of/restart_script <container_name>"

Specifically Monit will invoke the check_script to check whether the memory usage of container is larger than the threshold for 10 times within 20 cycles/minutes. If this condition is triggered, then restart_script will be invoked to restart the corresponding container.

We found that the container can be restarted if the condition was triggered. However, after the culprit container was restarted, if memory usage of that container immediately increased from around 100MB to be larger than threshold value within 60 seconds, then that container will not be restarted anymore during 1 hour and memory usage of it continuously increased to be up 11GB.

After doing some debugging, I think this failure was due to Monit can’t reset its counter and one of potential reasons is that Monit can reset its counter if and only if the status of monitored service was changed from Status failed to Status ok. Can you kindly help me check whether my understanding is correct or not please?

I see we recently introduced another syntax format repeat every <n> cycles for exec . But I am wondering whether we need fundamentally fix the issue related to Monit can not reset its counter?

Overall this is awesome project and we used lots of wonderful features from Monit. Thank you!

Comments (4)

  1. Lutz Mader

    Hello Yong Zhao,
    in a short form.
    You are right, from the monitor point of view the problem persist. As long as the status value is stil 3, the command will not executed again. With "repeat every <n> cycles" you can start your command again and again, after the given number of cycles.

    With regards,
    Lutz

    p.s.
    Any reason to use monit 5.20.0.

  2. Yong Zhao reporter

    Thanks so much for your clarification, Lutz!

    May I ask whether this issue was fixed in our latest release such that we can do upgrade?

  3. Lutz Mader

    Hello Yong Zhao,
    your description is right, this is the way how the monitor work.

    May I ask whether this issue was fixed in our latest release such that we can do upgrade?

    The only way to execute the command again is an additional “repeat every <n> cycles“ for the “exec”. The problem is, as long as the status value is still 3, the internal counter will not reseted. You can use a smaller interval, but this is not a goog idea (I think, one minute is a good idea).

    There is no way to “reset” an internal counter (a reload will do this, but you should not reload the configuration to do this).

    I use additional status values and tests to handle similar problems.

    # The recent CPU usage for the JVM process.
    check program Server1_CpuLoad with path "/usr/local/etc/monit/scripts/wlpmp.sh test Server1 base_cpu_processCpuLoad_percent gt 15"
      with timeout 20 seconds
      every 2 cycles
      if status > 2 then exec "/usr/local/etc/monit/scripts/zexec.sh Error"
      if status = 2 then exec "/usr/local/etc/monit/scripts/zexec.sh Warning"
      if status > 1 then exec "/usr/local/etc/monit/scripts/zexec.sh Info"
         else if succeeded then exec "/usr/local/etc/monit/scripts/zexec.sh Ok"
    #
    

    This sample start a script with options dependent to the status value returned by the used program script.

    A suggestion only,
    with regards,
    Lutz

  4. Log in to comment