Monit (5.27.1) fails to test the connection to the clamd socket (possibly because the process uptime missing) inside Debian10 LXD container
I am running monit (5.27.1) on Debian 10 in unprivileged LXD container.
After several days of operation, the clamav-daemon socket communication test reported an error:
failed protocol test [CLAMAV] at /var/run/clamav/clamd.ctl -- CLAMAV: PONG read error -- Resource temporarily unavailable
In this situation monit restarted clamav-deamon, but the still did not register the restoration of communication, even though the socket responded correctly.
Monit status shows (today’s examples):
Process 'clamav-daemon'
status OK
monitoring status Monitored
monitoring mode active
on reboot start
pid 2441018
parent pid 1
uid 106
effective uid 106
gid 109
uptime -
threads 2
children 0
cpu 0.0%
cpu total 0.0%
memory 19.0% [1.1 GB]
memory total 19.0% [1.1 GB]
security attribute unconfined
filedescriptors 9 [0.9% of 1024 limit]
total filedescriptors 9
read bytes 0.6 B/s [1.2 GB total]
disk read bytes 0 B/s [127.9 MB total]
disk read operations 0.0 reads/s [79728 reads total]
write bytes 2.7 B/s [1.1 MB total]
disk write bytes 34.1 B/s [3.9 MB total]
disk write operations 0.0 writes/s [2731 writes total]
unix socket response time -
data collected Wed, 24 Mar 2021 17:01:18
notice the lack of value for the uptime
of the process and unix socket response time
.
When running monit with “ -vvI” parameters, there are informations about paused connection test because of waiting for process to start:
'clamav-daemon' process is running with pid 2441018
'clamav-daemon' zombie check succeeded
'clamav-daemon' connection test paused for 30 s while the process is starting
This information is repeated in every 120 seconds (as set for daemon timer).
It looks as monit could not determine how much time has passed since the service was started, and was constantly postponing the socket connection test.
I cat get system uptime from /proc/uptime
# cat /proc/uptime
12279771.42 12157565.71
Also process stats are showed correctly
# cat /proc/2441018/stat
2441018 (clamd) S 1 2441018 2441018 0 -1 4194368 904957 0 10 0 2672 155 0 0 20 0 3 0 1254872419 1453367296 277755 18446744073709551615 94010525196288 94010525304089 140724476752176 0 0 0 2146542132 0 22531 0 0 0 17 1 0 0 0 0 0 94010525374832 94010525401136 94010546954240 140724476753893 140724476753963 140724476753963 140724476755944 0
Unfortunately, such a situation repeats itself from time to time, and each time the information about restoring the connection to the socket appears 5 days after detecting the problem.
Other processess configured in monit, are showing proper uptime
values.
check configration:
check process clamav-daemon with pidfile /var/run/clamav/clamd.pid
start program = "/etc/init.d/clamav-daemon start"
stop program = "/etc/init.d/clamav-daemon stop"
if failed unixsocket /var/run/clamav/clamd.ctl protocol clamav then restart
if 3 restarts for 3 cycles then alert
alert admin@example.com
Comments (4)
-
repo owner -
reporter Hello,
I have attached monit.log.
I have noticed, that there is a difference in uptime showed by
monit status
anduptime
comands:System 'smtp-server' status OK monitoring status Monitored monitoring mode active on reboot start load average [0.96] [1.62] [1.95] cpu 0.1%usr 0.0%sys 0.0%nice 0.0%iowait 0.0%hardirq 0.0%softirq 0.0%steal 0.0%guest 0.0%guestnice memory usage 1.7 GB [30.9%] swap usage 0 B [0.0%] uptime 148d 14h 14m boot time Wed, 28 Oct 2020 17:33:16 filedescriptors 81088 [0.0% of 9223372036854775807 limit] data collected Fri, 26 Mar 2021 07:47:52 root@smtp-server:~# uptime 07:47:56 up 143 days, 17:47, 1 user, load average: 0,89, 1,60, 1,94
There is a 5 days difference, so it can be the reason that after socket connection test failure, informations about connection test success is almost always after 5 days.
Additional informations:
- We do not use snap for LXD.
We build LXD package from sources.
Actually we use LXD 4.0.4-3 - We also build lxcfs package from oryginal sources (without any changes).
LXCFS version that we use actually: 4.0.6-2
- We do not use snap for LXD.
-
reporter - attached monit.log
<div class="preview-container wiki-content"><!-- loaded via ajax --></div> <div class="mask"></div> </div>
</div> </form>
-
repo owner - changed status to resolved
Thanks for data, it is really related to the boot time, the problem was fixed in monit 5.27.2 already, snip from changelog:
Fixed: LXC container: Monit may ignore the "start delay" option of the "set daemon" statement when the container was rebooted, while the host was not rebooted (the LXC cotainer's boot time is not virtualized - it is inherited from the host).
- Log in to comment
Hello Andrzej,
please can you attach a monit log? (you can enable it via the “set logfile” statement if not present already.
Checking the source code, it is possible it could be LXD bug, it seems it may report a future boot time compared to the current time (hence the uptime is still in the initializing state).