"check program" doesn't want to issue restart command
Hi,
Since sshd doesn't seem to create a pid file no matter what I try on my system (OpenSUSE 13.1), I've converted the check process script to check program, but I'm seeing several issues preventing it from working right. I'm on monit 5.8.
As per https://mmonit.com/monit/documentation/monit.html, for program status testing, action is a choice of "ALERT", "RESTART", "START", "STOP", "EXEC" or "UNMONITOR".
Unfortunately, it doesn't look like "restart" is working.
When the script looks like this:
check program sshd with path "/usr/sbin/rcsshd status"
start program = "/etc/init.d/sshd start"
stop program = "/etc/init.d/sshd stop"
restart program = "/etc/init.d/sshd restart"
if status != 0 then restart
if 5 restarts within 5 cycles then timeout
and I stop the sshd server, the program check detects the failure but then refuses to restart it. I'm seeing this in the log:
[PDT Apr 18 17:22:27] debug : monit: Start, stop or restart method not defined for process check 'sshd'
It looks like it's unable to see the restart program directive, or the start and stop, and refuses to actually do something.
Why is this happening?
Comments (23)
-
reporter -
Artem,
I have the same issue. Were you able to find a meaningful explanation?
-
reporter @alitvakl69 Nope, still waiting for the response from monit maintainers.
-
Say you don't know what tideslash's drink of choice :-)
-
did you try the pull request #9 ?
-
reporter No, considering the main problem is described in the first comment and doesn't have to do with sync as well as the patch being rejected by the developer.
-
<< I think monit executes the restart command and then tries to check the program status right away too fast. >>
maybe the request would be accepted if it fixes your bug
-
Alex Litvak sent you a message on Bitbucket:
I wanted to try
#9but it was failing to build for me. I contacted the contributor and he has not replied yet. Also this only somewhat covers the second part of the issue. So far none responded to why when attempting to do restart with stand alone program check or with stand alone connection check monit fails to do so saying "Start, stop, or restart methods were not defined" This sounds like a clear bug to me. -
restart is broken.
I used the pull request #9 myself because it simplify one of setup and it worked. But I have a lot of personal diffs and maybe i did fix something for compilation (my fork is private and quite active) especially in cervlet.c
-
May be I should try your fork then just to see if things would work out for me. I will post results here when I do. The idea is to use sync with restart program I guess.
-
The idea is more to put sync on the check program , i advice against any use of restart in the current state of monit.
-
Hmm,
Any use ? Restart in general works for check process with pid. Should I be worried there ?
-
IMHO restart in nonsense, what you want is more a reload.
Because restart is stop (if started) then start, and lots of program like the -HUP, for example if the configuration file of ssh change you may want to reload the daemon without cutting the connections by calling reload (SIGHUP) instead of reload.
So no worries, just look what restart does and maybe you could ask for a way to reload , someone did that already if you dig the mailing list.
What worries me is the answers of monit, they are always working in <the> new engine and there is no commit, nor branch to see where it goes.
-
repo owner From your report this seems like a bug and we'll look into this. We plan to have a Monit sprint next week to address open issues. Restart is not nonsense as someone claimed here. If you define a restart program, then this is the program, and the only program called when you do
..then restart
Many init, upstart or systems scripts also use restart because it might be a different operation than stop then start.
-
repo owner -
assigned issue to
- changed version to 5.8.1
-
assigned issue to
-
repo owner There was bug in Monit 5.7 and 5.8, which produced the mentioned error ("no start/stop/restart defined" even though they were present) if the monitored service was not "process" type. This problem is fixed now, you can get the development version from BitBucket:
https://bitbucket.org/tildeslash/monit/get/master.tar.gz
To compile:
sh ./bootstrap ./configure make
Best regards, The Monit team
-
repo owner - changed version to 5.8
-
repo owner - changed status to resolved
the fix is part of next Monit release (5.8.1 or 5.9)
-
Thank you. I will test it ASAP. However there was a second problem discovered and posted in the same issue. This has to do with timing of custom scripts exec. Please take a look at the second post in this issue. Any plans to address that? There was a proposal to use sync but it was rejected by you.
Thanks Again,
-
repo owner The "check program" problem is known issue, it is describe in the following bug:
https://bitbucket.org/tildeslash/monit/issue/19/race-condition-when-using-check-program
We will fix it with new non-blocking test scheduler (the old model will be dropped, so we don't plan to add the "sync" patch).
Regarding the original issue - if you want to check process with no pidfile, you can use pattern based process check, for example:
check process sshd matching "/usr/sbin/sshd -D"
Regards, The Monit team
-
repo owner - changed component to 1. Monit
-
repo owner - changed component to Monit
-
repo owner - removed version
Removing version: 5.8 (automated comment)
- Log in to comment
I also tried the following:
But I think it's not reading the full command in quotes, as in the log, I'm seeing this:
[PDT Apr 18 17:44:34] info : 'sshd' exec: /etc/init.d/sshd
without the "restart" bit.
There is a chance that it does work though, in which case the log line isn't complete, because after the 5th try, the check finally succeeds. I think there's a race condition. Here's what happens:
Note the tiny 1-2ms in the "Active:" lines as well as the "inactive (dead)" bits. I think monit executes the restart command and then tries to check the program status right away too fast. It did somehow succeed in the end, but not before retrying a bunch of times.
Ideally, the restart program in the original ticket comment above should be sorted out though, it still bothers me that I can't seem to make that work.
Any suggestions are welcome.
Thank you.