Monit version 5.30.0 - FOR CYCLES not respected

Issue #1030 new
Kölner Börsenverein created an issue

Hi everyone,

this is really just a great project, and we love this monit project! This lightweight tool is just really great! It is amazing! Thanks to all of you!

We still encounter this same behaviour (issue #64Issue #994)
This is Monit version 5.30.0

It is working fine BUT as soon as it is being triggered, the "FOR 4 CYCLES then alert" is no longer respected, and it is alerting every failed connection at a 60s interval. Especially The app takes about 10 minutes to restart.
Any help would be great and we would love to work out a final solution for all of us!

```

check host gRPC with address 127.0.0.1
  NOT EVERY "01-59 18 * * *"
  if failed
    port 2001 protocol http request /status FOR 4 CYCLES then alert



check program mdchecks with path "/etc/monit/scripts/mdcheck.sh" 
  NOT EVERY "20-40 18 * * *"
  restart program = "/bin/systemctl restart app.service"
  if status != 0 FOR 4 CYCLES then alert
  IF status != 0 FOR 16 CYCLES THEN RESTART 
  #if 2 restarts within 2 cycles then unmonitor
  depends on appd

```

Comments (5)

  1. Lutz Mader

    Hello,
    you can use “repeat every <n> cycles” in addition to an “exec” action, to defer/stretch the retries, see https://mmonit.com/monit/documentation/monit.html#ACTION

    But, this option is available for the “exec” action only.

    Unfortunately, “for <n> cycles” does not reset the counter, therefore the action will be repeated every cycle after the number of cycles was reached, see https://mmonit.com/monit/documentation/monit.html#FAULT-TOLERANCE

    A question, are you using the ”remainder” option with “set alert”, alerts are send only once after a match, I think, see https://mmonit.com/monit/documentation/monit.html#Setting-an-error-reminder

    A suggestion only,
    Lutz

    p.s.

    You should switch to monit 5.31.0, there are some line command problems in monit 5.30.0.

  2. Kölner Börsenverein reporter

    Hello @Lutz Mader

    @Lutz Mader

    ,

    thank you very much for your ideas and helping to solve the problems we recently encounter.

    But, this option is available for the “exec” action only.

    This could be a workaround, but unfortunately not really helpful. I will link this discussion for reference and more ideas if someone pumps into this issue as well.

    The documentation is not mentioning “for <n> cycles” does not reset the counter, therefore the action will be repeated every cycle after the number of cycles was reached“. That was very hard to understand in the first place for us. A very good point furthermore is to understand the behavior you linked, and we were not aware of this:

    For example if every second cycle fails (1-0-1-0-1-0-...), then "for 2 cycles" condition will never match, despite the service having problems. The following statement will catch such a state:

     if failed
        port 80
        for 3 times within 5 cycles
     then alert
    

    A question, are you using the ”remainder” option with “set alert”, alerts are send only once after a match, I think, see https://mmonit.com/monit/documentation/monit.html#Setting-an-error-reminder

    We just set alerts “on everything” to get notifications and find solutions. This is why we use and love monit on our server.

    As we just started to understand the ‘monit behavior’ we will get back with questions if that it ok for you? Maybe we will just send a DM?
    Another question: How long (hours) would it take to implement the desired functionality - ‘reset the counter and respect the cycles’?
    Could the ‘community' pay for this getting fixed? How does this work if that works?
    Many thanks getting back to us! We really appreciate all of your effort and ideas!
    Thank you Lutz Mader!

  3. Lutz Mader

    Hello,
    you can ask for additional support for Monit and/or M/Monit also.
    This is easy to do, a proposal only, see https://mmonit.com/shop/
    to get some more information how to pay for some more service/support.

    Could the ‘community' pay for this getting fixed?
    How does this work if that works?

    And you get some more useful function in addition to handle alarms and a central point to control your whole environment.

    A suggestion only, with regards,
    Lutz

  4. Lutz Mader

    Hello,
    based on your configuration snipped and your question I used the following configuration (based on monit 5.29.0 and 5.31.0) to do some tests.

    check host host_5001 with address host12345.intern
      every 1 cycles
      if failed port 5001 with timeout 30 seconds for 3 cycles then alert
    
    set alert lutz.mader@intern
    

    I got the following monitor log entries.

    [2022-03-14T11:30:56MEZ] debug    : 'host_5001' succeeded testing protocol [DEFAULT] at [host12345.intern]:5001 [TCP/IP] [response time 0.494 ms]
    [2022-03-14T11:30:56MEZ] debug    : 'host_5001' connection succeeded to [host12345.intern]:5001 [TCP/IP]
    [2022-03-14T11:31:17MEZ] warning  : 'host_5001' failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    [2022-03-14T11:32:17MEZ] warning  : 'host_5001' failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    [2022-03-14T11:33:17MEZ] error    : 'host_5001' failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    This was the alarm mail sent.
    [2022-03-14T11:34:17MEZ] error    : 'host_5001' failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    [2022-03-14T11:35:17MEZ] error    : 'host_5001' failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    [2022-03-14T11:36:17MEZ] error    : 'host_5001' failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    [2022-03-14T11:37:17MEZ] error    : 'host_5001' failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    [2022-03-14T11:38:17MEZ] error    : 'host_5001' failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    [2022-03-14T11:39:17MEZ] error    : 'host_5001' failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    [2022-03-14T11:40:17MEZ] error    : 'host_5001' failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    [2022-03-14T11:41:17MEZ] error    : 'host_5001' failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    [2022-03-14T11:42:05MEZ] error    : 'host_5001' failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    [2022-03-14T11:43:05MEZ] debug    : 'host_5001' succeeded testing protocol [DEFAULT] at [host12345.intern]:5001 [TCP/IP] [response time 0.525 ms]
    [2022-03-14T11:43:05MEZ] info     : 'host_5001' connection succeeded to [host12345.intern]:5001 [TCP/IP]
    And this is the resolve mail sent.
    [2022-03-14T11:44:05MEZ] debug    : 'host_5001' succeeded testing protocol [DEFAULT] at [host12345.intern]:5001 [TCP/IP] [response time 0.601 ms]
    [2022-03-14T11:44:05MEZ] debug    : 'host_5001' connection succeeded to [host12345.intern]:5001 [TCP/IP]
    

    And I got two mails for the failing connection only, the first was the error alarm and the secound was the resolve.

    Monit alert
    Service host_5001 event Connection failed at Mon, 14 Mar 2022 11:33:17 on SYSTEM01 failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    Yours sincerely, Monit
    
    Monit alert
    Service host_5001 event Connection succeeded at Mon, 14 Mar 2022 11:43:05 on SYSTEM01 connection succeeded to [host12345.intern]:5001 [TCP/IP]
    Yours sincerely, Monit
    

    With a little configuration change, I received some more error alarm mails.

    set alert lutz.mader@f-i.de with reminder on 4 cycles
    

    The alarm mail should resent after four cycles.

    Monit alert
    Service host_5001 event Connection failed at Mon, 14 Mar 2022 12:21:36 on SYSTEM01 failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    Yours sincerely, Monit
    
    Monit alert
    Service host_5001 event Connection failed at Mon, 14 Mar 2022 12:25:37 on SYSTEM01 failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    Yours sincerely, Monit
    
    Monit alert
    Service host_5001 event Connection failed at Mon, 14 Mar 2022 12:29:37 on SYSTEM01 failed protocol test [DEFAULT] at [host12345.intern]:5001 [TCP/IP] -- Connection refused
    Yours sincerely, Monit
    

    This is the way how the "reminder" should work.

    In a short form,
    I received alarms (mails) for the error and resolve only.
    All the time the configuration was reloaded the counters are initialized also, therefor the alarm was triggered again.
    And with a reminder, the alarms are sent again also.

    Works like expected based on monit 5.29.0 and 5.31.0, I thing.

    With regards,
    Lutz

  5. Lutz Mader

    Hello,
    I can not reproduce your problem, from my point of view the alert handling works as expected. Try to collect some more data to give some more information/help to find your problem.

    Hello Tildeslash,
    based on my investigation I find a litle glitch, there is no timeout for the test exec actions. I add some code to add a timeout handling to the exec command. If someone is interresting in this, please let me know, I will add a pull request.

    With regards,
    Lutz

    p.s.
    If you use “monit -v”, some more information are written to the monit log file. You should have a look to the “set alert” statements also.

  6. Log in to comment