monit 5.17.1 - solaris 11 bug with http check

Issue #487 resolved
Scott Halstead
created an issue

On Solaris we have enabled a http port check. Using -vv we see that the connection is refused but the test succeeds anyway.

[EDT Oct 19 09:04:49] debug    : Socket test failed for [127.0.0.1]:4242 -- Connection refused
[EDT Oct 19 09:04:49] debug    : 'logchipper' succeeded testing protocol [HTTP] at [localhost]:4242/health [TCP/IP] [response time 0.894 ms]

Comments (19)

  1. Scott Halstead reporter

    Config file

    check process logchipper pidfile /opt/inf/var/run/logchipper/logchipper.pid
      start program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper restart"
      restart program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper restart"
      stop program = "/usr/bin/sudo -u logchipper /etc/init.d/logchipper stop"
      if failed port 4242 protocol http request /health then restart
      if 3 restarts within 6 cycles then unmonitor
      if memory greater than 500 MB for 5 cycles then restart
    
  2. Tildeslash repo owner

    Tested with monit 5.17.1 and monit 5.19.0 on Solaris 11.3 with the following configuration:

    set daemon 5
    
    set httpd port 2812 allow localhost
    
    check system $HOST
    
    check process apache matching "httpd"
        if failed port 4242 protocol http request /health then restart
    

    Works fine:

    'apache' process is running with pid 1058
    'apache' zombie check succeeded
    Socket test failed for [::1]:4242 -- Connection refused
    Socket test failed for [127.0.0.1]:4242 -- Connection refused
    'apache' failed protocol test [HTTP] at [localhost]:4242/health [TCP/IP] -- Connection refused
    'apache' trying to restart
    'apache' stop skipped -- method not defined
    'apache' start method not defined
    'apache' monitoring enabled
    'apache' process is running with pid 1058
    'apache' zombie check succeeded
    

    How was the binary created? (compiled from source or using pre-compiled binary from our site?: https://mmonit.com/monit/dist/binary/)

    Please can you test with monit 5.19.0?

    If you don't specify the host explicitly, monit will try to connect to all interfaces to which the localhost resolves - usually IPv4 "127.0.0.1" and IPv6 "::1" ... is it possible that your port 4242 responds on IPv6 interface? (the log contains entry just for IPv4 followed by success). You can check it for example using pfiles:

    pfiles /proc/* | grep AF_INET | grep 4242
    
  3. Scott Halstead reporter

    I believe we used the binary distribution, 99% sure but need to check with a few folks to confirm (they are out at the moment).

    pfiles /proc/* 2>&1 | grep AF_INET | grep 4242 | grep -v permission                                                                                             
            sockname: AF_INET 0.0.0.0  port: 4242
    
  4. Scott Halstead reporter

    I downloaded your 5.19 binary and give it a whirl. I still see the same problem:

    'logchipper' process is running with pid 168
    'logchipper' zombie check succeeded
    'logchipper' mem amount check succeeded [current mem amount=1.5 MB]
    Socket test failed for [127.0.0.1]:4242 -- Connection refused
    'logchipper' succeeded testing protocol [HTTP] at [localhost]:4242/health [TCP/IP] [response time 3.827 s]
    'logchipper' connection succeeded to [localhost]:4242/health [TCP/IP]
    
  5. Scott Halstead reporter

    Looking at your code it appears that the exception thrown on the Socket test failure isn't propogating back correctly on Sun. You aren't falling into the catch section in validate.c in _checkConnection() line 151-158.

    I haven't put this into the debugger but the printf in the log "succeeded testing protocol" is just below the Socket_test() call

  6. Scott Halstead reporter

    Just a note that connection refused is the correct error. I using a bad / reused pid. What needs to happen is that if this test fails then the program should be restarted. monit is correctly detecting that the port can't be accessed but on Solaris (only) the error doesn't result in a restart.

  7. Scott Halstead reporter

    The problem we are trying to address is that we have seen cases where when a machine crashes and reboots the pid files are there at startup. The pid may be reused by another task. When monit does the pid check it succeeds as the task is alive. However it is not the task that we want (e.g. not logchipper but some other task). We don't have a means (via monit) to confirm that the executable associated with the pid is logchipper or not. Therefore we are using the port health check. Which is failing as expected but not doing the restart.

    It might be useful to augment monit to have a check for a pid and a task name - as a one time check during startup or run rarely (say once an hour).

  8. Tildeslash repo owner

    we're unable to reproduce the problem ... is it possible to get remote access to the host which has this problem, or can you provide vmware/virtualbox image of the testing system, where it fails?

  9. Log in to comment