monit stuck in "uninterruptible sleep" state (state D in top). kill -9 doesn't work.

Issue #317 closed
Arkadiy Kulev created an issue

Version 5.14. The setup is very simple, don't know where to search for problems. Monit works for some time and then just gets stuck. CentOS release 6.7 (Final)

Comments (16)

  1. Tildeslash repo owner

    Please can you send your monit log?

    It can be enabled with a "set logfile" statement ... either to specific file:

    set logfile <path>
    

    or syslog:

    set logfile syslog
    
  2. Arkadiy Kulev reporter

    I noticed that it was stuck on Feb 2. It's last lines were:

    [MSK Jan 28 05:39:30] info     : Reinitializing monit daemon
    [MSK Jan 28 06:06:15] info     : Starting Monit 5.14 daemon with http interface                                                                                         at [127.0.0.1]:2812
    [MSK Jan 28 06:06:15] info     : Starting Monit HTTP server at [127.0.0.1]:2812
    [MSK Jan 28 06:06:15] info     : Monit HTTP server started
    [MSK Jan 28 06:06:15] info     : 'mariadb-02.local' Monit 5.14 started
    [MSK Jan 28 06:06:16] error    : 'memcached' process is not running
    [MSK Jan 28 06:06:16] info     : 'memcached' trying to restart
    [MSK Jan 28 06:06:16] info     : 'memcached' start: /etc/init.d/memcached
    [MSK Jan 28 06:07:17] info     : 'memcached' process is running with pid 1029
    

    ps aux shows:

    root       987  0.0  0.0 120768  1948 ?        D    Jan28   0:06 monit
    
  3. Tildeslash repo owner

    Please can you take a backtrace?:

    gdb <path to monit binary> <monit's PID>
    (gdb) thread apply all backtrace full
    
  4. Arkadiy Kulev reporter

    I restarted the LXC it was running on. Will let you know once I see this problem again.

  5. Tildeslash repo owner

    If you use monit in LXC, we recommend to upgrade to the upcoming monit 5.16 release (should be available today) - it comes with fix related to LXC (if you have some process check with a connection test, the connection test was skipped, as it was not possible to collect part of data inside LXC container).

  6. Arkadiy Kulev reporter

    I had another one stuck in another LXC. I am not using any connection tests (only plain .pid checks).

    GDB is stuck too. It won't let me even enter the "thread apply all backtrace full" command. The last lines are:

    Reading symbols from /usr/bin/monit...(no debugging symbols found)...done.
    Attaching to program: /usr/bin/monit, process 509
    
  7. Tildeslash repo owner

    It seems that it could be LXC/kernel bug - monit probably hangs in some system call.

    There were fixes for signal handling in monit 5.15 + fix for LXC uptime in 5.16 ... i recommend to upgrade to monit 5.16, which was released today to isolate problems which were fixed already.

    Then start monit in debug mode using the "-v" option - monit will log more details about every operation it does, so we'll see what happened just before the hung.

  8. Tildeslash repo owner

    Thanks for data. It seems that the problem is in FUSE driver (not monit bug) ... monit just performs read and it seems that it stuck in the driver.

    The debug mode will help to trace which read triggers the issue.

  9. Tildeslash repo owner

    I think it should be possible to trigger the problem without involving monit, the following script collects the data from the /proc filesystem each 5 seconds similarly to monit, can you try to run it and see if it'll stuck as well?

    while (true)
        do cat /proc/meminfo /proc/stat /proc/[1-9]*/stat /proc/[1-9]*/status > /dev/null
        sleep 5
    done
    
  10. Log in to comment