if failed ping4 hangs up monit

Issue #226 resolved
Anonymous created an issue

Symptom: Data collected counter freezes for every item monitored. The http daemon is still working but no new checks are performed anymore. New commands are still possible (http or command line) but not executed. Monit start then summary shows "initializing" forever. Cause: check host router with address "192.168.0.x", if failed ping4 then alert rule. The moment 192.168.0.x went offline the next status change hangs. Monit reload, service monit restart does not help. Hangs on router initializing... Workaround: change address "192.168.0.x" to hostname. It looks like monit does not like icmp destination host unreachable responses from the router. Changing to hostname gives a unknown host message witch works. Environment: Android armhf cpu, debian chrooted +ipv6. Got this problem after moving from v5.12 to v5.13.

Comments (31)

  1. Tildeslash repo owner

    We're unable to replicate the issue, using the following configuration:

    check host nonexistenthost with address 192.168.1.100
        if failed ping4 then alert
    

    Debug output:

    Ping response for 192.168.1.100 1/3 timed out -- no response within 5 seconds
    Ping response for 192.168.1.100 2/3 timed out -- no response within 5 seconds
    Ping response for 192.168.1.100 3/3 timed out -- no response within 5 seconds
    'nonexistenthost' ping test failed
    

    ... and the monitoring continues normally after the error is reported

    Please can you run monit in verbose mode (using "-vI" options) and provide output? Please also collect a network trace of ICMP messages between monit host and the target machine.

  2. Andy Miller

    Sorry for the delay. I installed on an new box (other Droid version, different HW) and moved to v5.14. The problem is still there and can be reproduced by unplugging the ethernet cable from the target box. The next time monit does the test - bang. I did some tcpdump -ni wlan0 'dst 192.168.0.40 and icmp' but this shows only that monit got stuck after the second of the three (default) pings to the target. Like:

    ...726247 IP 192.168.0.30 > 192.168.0.40: ICMP echo request, id 9231, seq 0, length 72 ....730562 IP 192.168.0.30 > 192.168.0.40: ICMP echo request, id 9231, seq 1, length 72

    I still have the test box handy. If you need a full package trace then please could you provide me the tcpdump parameters you want? Thank you very much for your help.

  3. Tildeslash repo owner

    Thanks for update.

    We're still unable to replicate the problem.

    One machine pings an other box (link up on both sides):

    Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.008851s
    'raspberry' ping test succeeded [response time 0.009s]
    
    Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.002507s
    'raspberry' ping test succeeded [response time 0.003s]
    
    Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.023783s
    'raspberry' ping test succeeded [response time 0.024s]
    

    cable unplugged from the target box, left monit to repeat the test three times (each time with 3 retries with 5 seconds timeout per retry):

    Ping response for 192.168.1.12 1/3 timed out -- no response within 5 seconds
    Ping response for 192.168.1.12 2/3 timed out -- no response within 5 seconds
    Ping response for 192.168.1.12 3/3 timed out -- no response within 5 seconds
    'raspberry' ping test failed
    'raspberry' icmp ping failed, skipping any port connection tests
    
    Ping response for 192.168.1.12 1/3 timed out -- no response within 5 seconds
    Ping response for 192.168.1.12 2/3 timed out -- no response within 5 seconds
    Ping response for 192.168.1.12 3/3 timed out -- no response within 5 seconds
    'raspberry' ping test failed
    'raspberry' icmp ping failed, skipping any port connection tests
    
    Ping response for 192.168.1.12 1/3 timed out -- no response within 5 seconds
    Ping response for 192.168.1.12 2/3 timed out -- no response within 5 seconds
    Ping response for 192.168.1.12 3/3 timed out -- no response within 5 seconds
    'raspberry' ping test failed
    'raspberry' icmp ping failed, skipping any port connection tests
    

    connected the cable back to the target box, ping succeeded:

    Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=2.353867s
    'raspberry' ping test succeeded [response time 2.354s]
    
    Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.005074s
    'raspberry' ping test succeeded [response time 0.005s]
    
    Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.004035s
    'raspberry' ping test succeeded [response time 0.004s]
    
    Ping response for 192.168.1.12 1/3 succeeded -- received id=1144 sequence=0 response_time=0.097907s
    'raspberry' ping test succeeded [response time 0.098s]
    

    turned off the link on the source box (the errors are immediate this time - monit doesn't wait for ping timeout and returns error immediately as there is no link):

    Ping request for 192.168.1.12 1/3 failed -- No route to host
    Ping request for 192.168.1.12 2/3 failed -- No route to host
    Ping request for 192.168.1.12 3/3 failed -- No route to host
    'raspberry' ping test failed
    'raspberry' icmp ping failed, skipping any port connection tests
    

    We'll continue some tests yet, if we won't be able to find the root cause, will prepare debug version for you.

  4. Andy Miller

    Thank you very much for your effort. Maybe it's really my environment. The target I ping does not matter. I've tried with a Intel IA64 debian or an embeded Linux running in a router or even local services like nmbd (if failed host localhost port 137 type UDP retry 3 then restart). Everything except TCP get stuck. And the source (monit host) is always a chrooted jessie (debian) running on different hardware like HP Touchpad (Kernel v3.0.101 cyanogenmod) or HP Slate7 (Kernel v3.0.8 ICS). What's in common: jessie (armhf) and IPV6. I want to mention again monit v5.12 worked without any problem. But rather than going back I moved all ipaddr to hostnames and replaced all UDP tests by TCP if possible or commented them out. So I'm still very happy with this workaround. Monit is such a great product - I can not live without it!

  5. Tildeslash repo owner

    Thanks for update. If the problem goes away when UDP is not used, it is more likely related to the UDP communication.

    Please can you gather the following data?:

    1.) attach strace to monit:

    sudo strace -p <monit's PID> -s 256 -o /tmp/monit.strace
    

    2.) trigger the problem

    3.) when the problem started, collect the strace yet for ca. 5 minutes, then stop it (^C) and send the /tmp/monit.strace file to support@mmonit.com

  6. Andy Miller

    Thanks for the test version. Got it compiled. Monit -V says version 5.14.p1. Unfortunately the problem is not fixed yet. I'm rebooting - just in case and do another strace. Thanks for reading.

    ...strace send.

  7. Tildeslash repo owner

    Thanks for update and testing. The problem is, that poll() is called with crazy timeout value:

    poll([{fd=5, events=POLLIN}], 1, 1452059047)    #note: the last argument should be timeout in [ms], lesser or equal to 5000ms by default
    

    One possible root cause was if the time jumped back (fixed in provided 5.14-p1), but if it didn't fix the problem, it seems either as some memory corruption or overflow ... both are strange. The value "1452059047" looks like an unix timestamp, so it seems that the timeout argument was somehow corrupted - will continue with the analysis tomorrow.

  8. Tildeslash repo owner

    I meant data type overflow (not system memory) ... your machine is most probably fine, no need for more tests now, i'm trying to reproduce the problem on ARM now, as it seems it could be specific problem for that environment - i think it the corruption could be caused by the local variables declaration vs. ARM's compiler, which probably reused the space, but that is just speculation, i'll update as soon as i'll have more details.

  9. Andy Miller

    First impression. Something changed!!! Second impression. ALL GOOD. Working!!! Congratulation. Ping tests are working beautifully again. Thank you very much Sir!

  10. monituser

    I re-tested and found that the bug from #251/#252 only occurs if you have the ping count set to 1.

    If the ping count isn't defined (ie. default 3), or another value is set, the patch above works as expected.

  11. Tildeslash repo owner

    I'm not able to reproduce the issue, using the following configuration:

    set daemon 5
    set httpd port 2812 allow 127.0.0.1
    
    check host foobar with address 192.168.1.11
        if failed ping count 1 then alert
    

    Output:

    Ping response for 192.168.1.11 1/1 timed out -- no response within 5 seconds
    'foobar' ping test failed
    'foobar' icmp ping failed, skipping any port connection tests
    

    Please can you send the following data?:

    1. monit log
    2. monit configuration
    3. full output of monit in debug mode: monit -vI
    4. strace output ... keep strace running until monit freezes, then keep it running yet for ca. 5 minutes and break with ^C (send the /tmp/monit.strace file to support@mmonit.com)

      sudo strace -p <monit's PID> -s 256 -o /tmp/monit.strace

  12. Tildeslash repo owner

    Update for the ping with "count 1": the problem really is different then the issue #226, i was testing it more - when the ping count was 1, monit wasn't happy with the ping response as the sequence id sanity check failed (due to bug), and tried to wait up to timeout seconds (by default 5s) for a better response, which won't come. Monit thus didn't hung, but was suspended by timeout for each ping test - if you have 50 hosts as mentioned, the test progress and service initialization will be very slow.

    I have prepared a patched 5.14, which includes fix for issue #226 and #251, please @monituser can you test?: https://mmonit.com/tmp/monit-5.14-p4.tar.gz

  13. Steven Christensen

    Hello, I am using Monit 5.15 (not beta) on a Raspberry Pi and I appear to be having this very problem. It exhibits the same symptoms - monitoring does not go past the ping test if the host is not present.

    I have linked to the relevant debug files - can someone see if this is the same problem? I don't grok strace...

    monitrc - http://pastebin.com/nTcNw4CU

    monit debug output (stdout) - http://pastebin.com/i8EzVRSs

    monit log output - http://pastebin.com/iQxCPQvX

    monit strace output - http://pastebin.com/FAby8L9Z

  14. Steven Christensen

    I think I see suspicious poll timeouts in the end of the strace data

    poll([{fd=6, events=POLLIN}], 1, 4000)  = 1 ([{fd=6, revents=POLLIN}])
    gettimeofday({1447360948, 803242}, NULL) = 0
    recvfrom(6, "...data...", 1500, 0, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.0.1")}, [16]) = 120
    poll([{fd=6, events=POLLIN}], 1, -2993190) = 1 ([{fd=6, revents=POLLIN}])
    gettimeofday({1447360972, 133159}, NULL) = 0
    recvfrom(6, "...data...", 1500, 0, {sa_family=AF_INET, sin_port=htons(0), sin_addr=inet_addr("192.168.0.1")}, [16]) = 76
    poll([{fd=6, events=POLLIN}], 1, -26323107 <unfinished ...>
    

    Of the three poll activities, only the first one looks like it has a reasonable timeout (4000, which matches the 4 seconds specified in the monitrc file). The other two look suspicious / wrong...

  15. Steven Christensen

    I downloaded monit 5.15_p1 and installed it on my Raspberry Pi, and it appears to be functioning correctly. It got past the failure to ping correctly. Thank you for your quick response - I would call this fixed.

  16. Log in to comment