- edited description
monit stuck in "uninterruptible sleep" state (state D in top). kill -9 doesn't work.
Version 5.14. The setup is very simple, don't know where to search for problems. Monit works for some time and then just gets stuck. CentOS release 6.7 (Final)
Comments (16)
-
reporter -
reporter - edited description
-
repo owner Please can you send your monit log?
It can be enabled with a "set logfile" statement ... either to specific file:
set logfile <path>
or syslog:
set logfile syslog
-
reporter I noticed that it was stuck on Feb 2. It's last lines were:
[MSK Jan 28 05:39:30] info : Reinitializing monit daemon [MSK Jan 28 06:06:15] info : Starting Monit 5.14 daemon with http interface at [127.0.0.1]:2812 [MSK Jan 28 06:06:15] info : Starting Monit HTTP server at [127.0.0.1]:2812 [MSK Jan 28 06:06:15] info : Monit HTTP server started [MSK Jan 28 06:06:15] info : 'mariadb-02.local' Monit 5.14 started [MSK Jan 28 06:06:16] error : 'memcached' process is not running [MSK Jan 28 06:06:16] info : 'memcached' trying to restart [MSK Jan 28 06:06:16] info : 'memcached' start: /etc/init.d/memcached [MSK Jan 28 06:07:17] info : 'memcached' process is running with pid 1029
ps aux shows:
root 987 0.0 0.0 120768 1948 ? D Jan28 0:06 monit
-
repo owner Please can you take a backtrace?:
gdb <path to monit binary> <monit's PID> (gdb) thread apply all backtrace full
-
reporter I restarted the LXC it was running on. Will let you know once I see this problem again.
-
repo owner If you use monit in LXC, we recommend to upgrade to the upcoming monit 5.16 release (should be available today) - it comes with fix related to LXC (if you have some process check with a connection test, the connection test was skipped, as it was not possible to collect part of data inside LXC container).
-
repo owner - changed status to on hold
waiting for data (stacktrace when the problem will occur again)
-
reporter I had another one stuck in another LXC. I am not using any connection tests (only plain .pid checks).
GDB is stuck too. It won't let me even enter the "thread apply all backtrace full" command. The last lines are:
Reading symbols from /usr/bin/monit...(no debugging symbols found)...done. Attaching to program: /usr/bin/monit, process 509
-
repo owner - changed status to open
-
repo owner It seems that it could be LXC/kernel bug - monit probably hangs in some system call.
There were fixes for signal handling in monit 5.15 + fix for LXC uptime in 5.16 ... i recommend to upgrade to monit 5.16, which was released today to isolate problems which were fixed already.
Then start monit in debug mode using the "-v" option - monit will log more details about every operation it does, so we'll see what happened just before the hung.
-
repo owner -
assigned issue to
-
assigned issue to
-
repo owner Thanks for data. It seems that the problem is in FUSE driver (not monit bug) ... monit just performs read and it seems that it stuck in the driver.
The debug mode will help to trace which read triggers the issue.
-
repo owner I think it should be possible to trigger the problem without involving monit, the following script collects the data from the /proc filesystem each 5 seconds similarly to monit, can you try to run it and see if it'll stuck as well?
while (true) do cat /proc/meminfo /proc/stat /proc/[1-9]*/stat /proc/[1-9]*/status > /dev/null sleep 5 done
-
repo owner - changed status to closed
Not monit bug, FUSE driver issue.
The problem could be related to following FUSE CVE: http://people.canonical.com/~ubuntu-security/cve/2015/CVE-2015-8785.html
-
repo owner - removed version
Removing version: 5.14 (automated comment)
- Log in to comment