monit stuck in "uninterruptible sleep" state (state D in top). kill -9 doesn't work.

Issue #317 closed

Arkadiy Kulev created an issue 2016-02-03

Version 5.14. The setup is very simple, don't know where to search for problems. Monit works for some time and then just gets stuck. CentOS release 6.7 (Final)

Comments (16)

Arkadiy Kulev reporter
- edited description
- 2016-02-03T02:53:27+00:00
Arkadiy Kulev reporter
- edited description
- 2016-02-03T02:53:45+00:00
Tildeslash repo owner
Please can you send your monit log?

It can be enabled with a "set logfile" statement ... either to specific file:
```
set logfile <path>
```
or syslog:
```
set logfile syslog
```
- 2016-02-03T08:58:16+00:00

Arkadiy Kulev reporter

I noticed that it was stuck on Feb 2. It's last lines were:

[MSK Jan 28 05:39:30] info     : Reinitializing monit daemon
[MSK Jan 28 06:06:15] info     : Starting Monit 5.14 daemon with http interface                                                                                         at [127.0.0.1]:2812
[MSK Jan 28 06:06:15] info     : Starting Monit HTTP server at [127.0.0.1]:2812
[MSK Jan 28 06:06:15] info     : Monit HTTP server started
[MSK Jan 28 06:06:15] info     : 'mariadb-02.local' Monit 5.14 started
[MSK Jan 28 06:06:16] error    : 'memcached' process is not running
[MSK Jan 28 06:06:16] info     : 'memcached' trying to restart
[MSK Jan 28 06:06:16] info     : 'memcached' start: /etc/init.d/memcached
[MSK Jan 28 06:07:17] info     : 'memcached' process is running with pid 1029

ps aux shows:

root       987  0.0  0.0 120768  1948 ?        D    Jan28   0:06 monit

2016-02-03T09:00:51+00:00

Tildeslash repo owner

Please can you take a backtrace?:

gdb <path to monit binary> <monit's PID>
(gdb) thread apply all backtrace full

2016-02-03T09:07:28+00:00

Arkadiy Kulev reporter
I restarted the LXC it was running on. Will let you know once I see this problem again.
- 2016-02-03T09:08:52+00:00
Tildeslash repo owner
If you use monit in LXC, we recommend to upgrade to the upcoming monit 5.16 release (should be available today) - it comes with fix related to LXC (if you have some process check with a connection test, the connection test was skipped, as it was not possible to collect part of data inside LXC container).
- 2016-02-03T09:17:53+00:00
Tildeslash repo owner
- changed status to on hold
waiting for data (stacktrace when the problem will occur again)
- 2016-02-03T09:24:32+00:00
Arkadiy Kulev reporter
I had another one stuck in another LXC. I am not using any connection tests (only plain .pid checks).

GDB is stuck too. It won't let me even enter the "thread apply all backtrace full" command. The last lines are:
```
Reading symbols from /usr/bin/monit...(no debugging symbols found)...done.
Attaching to program: /usr/bin/monit, process 509
```
- 2016-02-03T18:57:11+00:00
Tildeslash repo owner
- changed status to open
- 2016-02-04T11:03:26+00:00
Tildeslash repo owner
It seems that it could be LXC/kernel bug - monit probably hangs in some system call.

There were fixes for signal handling in monit 5.15 + fix for LXC uptime in 5.16 ... i recommend to upgrade to monit 5.16, which was released today to isolate problems which were fixed already.

Then start monit in debug mode using the "-v" option - monit will log more details about every operation it does, so we'll see what happened just before the hung.
- 2016-02-04T11:12:30+00:00
Tildeslash repo owner
- assigned issue to
  
  Tildeslash
- 2016-02-05T17:40:10+00:00
Tildeslash repo owner
Thanks for data. It seems that the problem is in FUSE driver (not monit bug) ... monit just performs read and it seems that it stuck in the driver.

The debug mode will help to trace which read triggers the issue.
- 2016-02-08T10:04:53+00:00
Tildeslash repo owner
I think it should be possible to trigger the problem without involving monit, the following script collects the data from the /proc filesystem each 5 seconds similarly to monit, can you try to run it and see if it'll stuck as well?
```
while (true)
    do cat /proc/meminfo /proc/stat /proc/[1-9]*/stat /proc/[1-9]*/status > /dev/null
    sleep 5
done
```
- 2016-02-12T09:31:40+00:00
Tildeslash repo owner
- changed status to closed
Not monit bug, FUSE driver issue.

The problem could be related to following FUSE CVE: http://people.canonical.com/~ubuntu-security/cve/2015/CVE-2015-8785.html
- 2016-02-23T08:22:02+00:00
Tildeslash repo owner
- removed version
Removing version: 5.14 (automated comment)
- 2016-06-19T18:47:48+00:00
Log in to comment

Assignee: Tildeslash

Type: bug

Priority: blocker

Status: closed

Component: Monit

Version: –

Votes: 0

Watchers: 1