"check program" doesn't want to issue restart command

Issue #34 resolved
Artem Russakovskii created an issue

Hi,

Since sshd doesn't seem to create a pid file no matter what I try on my system (OpenSUSE 13.1), I've converted the check process script to check program, but I'm seeing several issues preventing it from working right. I'm on monit 5.8.

As per https://mmonit.com/monit/documentation/monit.html, for program status testing, action is a choice of "ALERT", "RESTART", "START", "STOP", "EXEC" or "UNMONITOR".

Unfortunately, it doesn't look like "restart" is working.

When the script looks like this:

  check program sshd with path "/usr/sbin/rcsshd status"
  start program = "/etc/init.d/sshd start"
  stop program = "/etc/init.d/sshd stop"
  restart program = "/etc/init.d/sshd restart"
  if status != 0 then restart
  if 5 restarts within 5 cycles then timeout

and I stop the sshd server, the program check detects the failure but then refuses to restart it. I'm seeing this in the log:

[PDT Apr 18 17:22:27] debug : monit: Start, stop or restart method not defined for process check 'sshd'

It looks like it's unable to see the restart program directive, or the start and stop, and refuses to actually do something.

Why is this happening?

Comments (23)

  1. Artem Russakovskii reporter

    I also tried the following:

      check program sshd with path "/usr/sbin/rcsshd status"
      if status != 0 then exec "/etc/init.d/sshd restart"
      if 5 restarts within 5 cycles then timeout
    

    But I think it's not reading the full command in quotes, as in the log, I'm seeing this:

    [PDT Apr 18 17:44:34] info : 'sshd' exec: /etc/init.d/sshd

    without the "restart" bit.

    There is a chance that it does work though, in which case the log line isn't complete, because after the 5th try, the check finally succeeds. I think there's a race condition. Here's what happens:

    [PDT Apr 18 17:39:16] info     : Reinitializing monit - Control file '/etc/monitrc'
    [PDT Apr 18 17:39:16] info     : Shutting down monit HTTP server
    [PDT Apr 18 17:39:17] info     : monit HTTP server stopped
    [PDT Apr 18 17:39:17] info     : Starting monit HTTP server at [*:2812]
    [PDT Apr 18 17:39:17] info     : monit HTTP server started
    [PDT Apr 18 17:39:17] info     : 'forge' Monit reloaded
    [PDT Apr 18 17:39:20] info     : 'sshd' monitor on user request
    [PDT Apr 18 17:39:20] info     : monit daemon with PID 7454 awakened
    [PDT Apr 18 17:39:20] info     : Awakened by User defined signal 1
    [PDT Apr 18 17:39:20] info     : 'sshd' monitor action done
    [PDT Apr 18 17:39:25] info     : 'sshd' monitor on user request
    [PDT Apr 18 17:39:25] info     : monit daemon with PID 7454 awakened
    [PDT Apr 18 17:39:25] info     : Awakened by User defined signal 1
    [PDT Apr 18 17:39:25] info     : 'sshd' monitor action done
    [PDT Apr 18 17:41:29] error    : 'sshd' Checking for service sshd ..unused
    sshd.service - OpenSSH Daemon
       Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled)
       Active: inactive (dead) since Fri 2014-04-18 17:11:14 PDT; 29min ago
      Process: 6693 ExecStart=/usr/sbin/sshd -D $SSHD_OPT
    [PDT Apr 18 17:41:30] info     : 'sshd' exec: /etc/init.d/sshd
    [PDT Apr 18 17:42:31] error    : 'sshd' Checking for service sshd ..unused
    sshd.service - OpenSSH Daemon
       Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled)
       Active: inactive (dead) since Fri 2014-04-18 17:11:14 PDT; 30min ago
      Process: 6693 ExecStart=/usr/sbin/sshd -D $SSHD_OPT
    [PDT Apr 18 17:42:31] info     : 'sshd' exec: /etc/init.d/sshd
    [PDT Apr 18 17:43:33] error    : 'sshd' Checking for service sshd ..unused
    sshd.service - OpenSSH Daemon
       Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled)
       Active: inactive (dead) since Fri 2014-04-18 17:42:31 PDT; 2ms ago
      Process: 11415 ExecStart=/usr/sbin/sshd -D $SSHD_OPTS
    [PDT Apr 18 17:43:33] info     : 'sshd' exec: /etc/init.d/sshd
    
    [PDT Apr 18 17:44:34] error    : 'sshd' Checking for service sshd ..unused
    sshd.service - OpenSSH Daemon
       Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled)
       Active: inactive (dead) since Fri 2014-04-18 17:43:33 PDT; 1ms ago
      Process: 11713 ExecStart=/usr/sbin/sshd -D $SSHD_OPTS
    [PDT Apr 18 17:44:34] info     : 'sshd' exec: /etc/init.d/sshd
    [PDT Apr 18 17:45:35] error    : 'sshd' Checking for service sshd ..unused
    sshd.service - OpenSSH Daemon
       Loaded: loaded (/usr/lib/systemd/system/sshd.service; enabled)
       Active: inactive (dead) since Fri 2014-04-18 17:44:34 PDT; 1ms ago
      Process: 11946 ExecStart=/usr/sbin/sshd -D $SSHD_OPTS
    [PDT Apr 18 17:45:35] info     : 'sshd' exec: /etc/init.d/sshd
    [PDT Apr 18 17:46:36] info     : 'sshd' status succeeded
    

    Note the tiny 1-2ms in the "Active:" lines as well as the "inactive (dead)" bits. I think monit executes the restart command and then tries to check the program status right away too fast. It did somehow succeed in the end, but not before retrying a bunch of times.

    Ideally, the restart program in the original ticket comment above should be sorted out though, it still bothers me that I can't seem to make that work.

    Any suggestions are welcome.

    Thank you.

  2. Artem Russakovskii reporter

    No, considering the main problem is described in the first comment and doesn't have to do with sync as well as the patch being rejected by the developer.

  3. Dohnuts

    << I think monit executes the restart command and then tries to check the program status right away too fast. >>

    maybe the request would be accepted if it fixes your bug

  4. Alexander Litvak

    Alex Litvak sent you a message on Bitbucket:

    I wanted to try #9 but it was failing to build for me. I contacted the contributor and he has not replied yet. Also this only somewhat covers the second part of the issue. So far none responded to why when attempting to do restart with stand alone program check or with stand alone connection check monit fails to do so saying "Start, stop, or restart methods were not defined" This sounds like a clear bug to me.

  5. Dohnuts

    restart is broken.

    I used the pull request #9 myself because it simplify one of setup and it worked. But I have a lot of personal diffs and maybe i did fix something for compilation (my fork is private and quite active) especially in cervlet.c

  6. Alexander Litvak

    May be I should try your fork then just to see if things would work out for me. I will post results here when I do. The idea is to use sync with restart program I guess.

  7. Dohnuts

    The idea is more to put sync on the check program , i advice against any use of restart in the current state of monit.

  8. Alexander Litvak

    Hmm,

    Any use ? Restart in general works for check process with pid. Should I be worried there ?

  9. Dohnuts

    IMHO restart in nonsense, what you want is more a reload.

    Because restart is stop (if started) then start, and lots of program like the -HUP, for example if the configuration file of ssh change you may want to reload the daemon without cutting the connections by calling reload (SIGHUP) instead of reload.

    So no worries, just look what restart does and maybe you could ask for a way to reload , someone did that already if you dig the mailing list.

    What worries me is the answers of monit, they are always working in <the> new engine and there is no commit, nor branch to see where it goes.

  10. Tildeslash repo owner

    From your report this seems like a bug and we'll look into this. We plan to have a Monit sprint next week to address open issues. Restart is not nonsense as someone claimed here. If you define a restart program, then this is the program, and the only program called when you do

    ..then restart
    

    Many init, upstart or systems scripts also use restart because it might be a different operation than stop then start.

  11. Tildeslash repo owner

    There was bug in Monit 5.7 and 5.8, which produced the mentioned error ("no start/stop/restart defined" even though they were present) if the monitored service was not "process" type. This problem is fixed now, you can get the development version from BitBucket:

    https://bitbucket.org/tildeslash/monit/get/master.tar.gz

    To compile:

        sh ./bootstrap
        ./configure
        make
    

    Best regards, The Monit team

  12. Alexander Litvak

    Thank you. I will test it ASAP. However there was a second problem discovered and posted in the same issue. This has to do with timing of custom scripts exec. Please take a look at the second post in this issue. Any plans to address that? There was a proposal to use sync but it was rejected by you.

    Thanks Again,

  13. Tildeslash repo owner

    The "check program" problem is known issue, it is describe in the following bug:

    https://bitbucket.org/tildeslash/monit/issue/19/race-condition-when-using-check-program

    We will fix it with new non-blocking test scheduler (the old model will be dropped, so we don't plan to add the "sync" patch).

    Regarding the original issue - if you want to check process with no pidfile, you can use pattern based process check, for example:

    check process sshd matching "/usr/sbin/sshd -D"
    

    Regards, The Monit team

  14. Log in to comment