CPU usage 100% and RSPAMD 'I/O' time out

Issue #502 resolved

SB created an issue 2018-11-08

Does anyone know what would be causing this?

We had an issue today where the Queue seemed to get stuck and mail struggled to deliver for 3 hours. CPU usage was 100% with about 30 mails in the queue (some were still delivering but some seemed to be stuck). I restarted the container 2x with only a small effect, pretty soon it jammed up again. Then I cleared about 5 of the emails - one by one - (all large with 20mb+ attachments) and the rest started to deliver and CPU usage went back to normal within 10 minutes. Any ideas? How can I make sure this doesn't happen again?

lookup timeout image.PNG

Comments (24)

SB reporter
You can see where we had the issues:
- 2018-11-08T22:09:50+00:00
SB reporter
- edited description
- 2018-11-08T22:11:20+00:00
SB reporter
Port 25 and IPv4 on the diagnostics was also intermittently showing as red / with a warning something like port 25 connected but no message text received (or something similar)
- 2018-11-08T22:20:23+00:00
Scott MacDonald
Some kind of attack against your server? What do Poste.io logs show during the timeframe... lots of invalid connections?
- 2018-11-08T23:45:24+00:00
SB reporter
auth log has many in …

Nov 8 14:03:14 mail dovecot: lmtp: Error: sendmail: cannot connect to 127.0.0.1, port 25: Connection refused

Nov 8 14:03:14 mail dovecot: lmtp: Error: sendmail: could not send mail (account default from /etc/msmtprc)

Nov 8 14:06:18 mail dovecot: lmtp(15254): Connect from 127.0.0.1

Nov 8 14:06:18 mail dovecot: lmtp(18851): Connect from 127.0.0.1

Nov 8 14:06:18 mail dovecot: lmtp(13850): Connect from 127.0.0.1

Nov 8 14:06:18 mail dovecot: lmtp(18360): Connect from 127.0.0.1

Nov 8 14:06:18 mail dovecot: lmtp(18469): Connect from 127.0.0.1

Nov 8 14:06:18 mail dovecot: lmtp(18689): Connect from 127.0.0.1

Nov 8 14:06:18 mail dovecot: lmtp(19899): Connect from 127.0.0.1

Nov 8 14:06:18 mail dovecot: lmtp(19900): Connect from 127.0.0.1

Nov 8 14:06:18 mail dovecot: lmtp(19902): Connect from 127.0.0.1

Nov 8 14:13:27 mail dovecot: lmtp: Error: cannot connect to 127.0.0.1, port 25: Connection refused

Nov 8 14:13:27 mail dovecot: lmtp: Error: cannot connect to 127.0.0.1, port 25: Connection refusedsendmail: could not send mail (account default from /etc/msmtprc)

Nov 8 14:13:27 mail dovecot: lmtp: Error:

Nov 8 14:13:27 mail dovecot: lmtp: Error: sendmail: could not send mail (account default from /etc/msmtprc)

Nov 8 14:13:27 mail dovecot: lmtp(<<email address>>): Error: Sendmail process terminated abnormally, exit status 75

Nov 8 14:15:59 mail dovecot: lmtp(<<email address>): aCoZO4tE5Fu4RwAASs3S/g: sieve: Execution of script /data/domains/<<domain>>/<<address>>/.dovecot.sieve failed, but implicit keep was successful (user logfile /data/domain$

Nov 8 14:15:59 mail dovecot: lmtp(<<email address>): Error: rename(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock, /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory

Nov 8 14:15:59 mail dovecot: lmtp(<<email address>): Warning: Our dotlock file /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock was deleted (kept it 133 secs)

Nov 8 14:15:59 mail dovecot: lmtp(<<email address>): Error: file_dotlock_replace(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory

Nov 8 15:01:01 mail dovecot: lmtp(<<address>>): Error: YGUXLXRP5Fs1fAAASs3S/g: sieve: Execution of script /data/domains/<<domain>>/<<address>>/.dovecot.sieve was aborted due to temporary failure (user logfile$

Nov 8 15:01:01 mail dovecot: lmtp(<<address>>): Error: aNT3NDZP5FuQdAAASs3S/g: sieve: msgid=<<MSGIDHERE>>: failed to redirect message to <<email address>>…

Nov 8 15:01:01 mail dovecot: lmtp(<<address>>): Error: aNT3NDZP5FuQdAAASs3S/g: sieve: Execution of script /data/domains/<<domain>>/<<address>>/.dovecot.sieve was aborted due to temporary failure (user logfile /data/dom$

Nov 8 15:01:01 mail dovecot: lmtp(<<address>>): Warning: Our dotlock file /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock was overridden (kept it 102 secs)

Nov 8 15:01:01 mail dovecot: lmtp(29840): Disconnect from 127.0.0.1: Connection closed (in DATA finished)

Nov 8 15:01:01 mail dovecot: lmtp: Error: sendmail: sendmail: cannot connect to 127.0.0.1, port 25: Connection refused

Nov 8 15:01:01 mail dovecot: lmtp: Error: sendmail: could not send mail (account default from /etc/msmtprc)

Nov 8 15:01:01 mail dovecot: lmtp: Error: cannot connect to 127.0.0.1, port 25: Connection refused

Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Error: rename(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock, /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory

Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Warning: Our dotlock file /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock was deleted (kept it 285 secs)

Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Error: file_dotlock_replace(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory

Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): SEa6OZ1T5FuMBwAASs3S/g: sieve: msgid=<<MSGIDHERE>>: stored mail into mailbox 'INBOX'

Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): SEa6OZ1T5FuMBwAASs3S/g: sieve: Execution of script /data/domains/<<domain>>/<<address>>/.dovecot.sieve failed, but implicit keep was successful (user logfile /data/domain$

Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Error: rename(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock, /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory

Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Warning: Our dotlock file /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock was deleted (kept it 272 secs)

Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Error: file_dotlock_replace(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory

Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): mLeeOZ1T5FuLBwAASs3S/g: sieve: msgid=<<MSGIDHERE>>: stored mail into mailbox 'INBOX'
- 2018-11-09T00:27:29+00:00
Scott MacDonald
The graphs might give you an indication.. if not check log file sizes.. that might hint at if you're seeing more inbound traffic than normal.
- 2018-11-09T00:27:51+00:00
Scott MacDonald
Which version are you on? Latest v2.0.19
- 2018-11-09T00:29:36+00:00
SB reporter
Latest Pro
- 2018-11-09T00:30:09+00:00
SB reporter
Today was an 'average' day server wise, nothing out of the ordinary on stats
- 2018-11-09T00:30:54+00:00
SH repo owner
Do you have emails which caused this? From your description it seems that email is internally processed with sieve and then smtp communication fails.

Logs which might help:
- log/s6/haraka-smtp/* - there might be any indication of exception SMTP communication around that time, I am almost sure that there will be primary cause logged
- log/delivery/del/* - if there is "rotating" email in queue, it should generate large log - it is easy to find largest file with something like find / -type f -printf "%s\t%p\n" | sort -n | tail -10
- 2018-11-09T07:27:26+00:00
Scott MacDonald
Is this is an issue?? Did u have logs as suggested by SH?
- 2018-11-10T00:36:34+00:00
SB reporter
Thank you both for the info above, it seems to be OK today, but had a v. busy day and not had time to check the suggestion above yet. I will take a more in depth look over the weekend and update.
- 2018-11-10T00:37:47+00:00
SB reporter
I will close this pending the outcome of my investigation. It looks like a client (Outlook 2016) was stuck in a forwarding loop for a 23mb email,
- 2018-11-13T16:51:41+00:00
SB reporter
Thank you both kindly for your help.
- 2018-11-13T16:51:54+00:00
SB reporter
- changed status to resolved
- 2018-11-13T16:52:03+00:00
SB reporter
- changed status to open
OK - this looks like its Redis DB which is causing this. The same issue has just happened again. The server was delivering multiple emails to people again, and it appears there was an issue where the following error was coming up in RSPAMD error log -

MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.

Redis folder inside /data/redis/

Restarting the server (uptime was 6 days) seemed to fix this, but the first login to the admin area gave a 500 error for a couple of minutes (didnt get a screenshot) but saying the same kind of thing as the following RSPAND error. After a couple of minutes then it seemed to login and work.

After this it seems to work normal, but now gives an error saying :

Is this a resource issue? Server has 3.5 GB memory (900mb currently free - 650mb at time of errors before rebooting) and is a dedicated (not burstable) VM on Azure (DSv1)
- 2018-11-15T10:18:39+00:00
SB reporter
REDIS log files

Would and more importantly Should this be fixed using: https://stackoverflow.com/questions/44800633/how-to-disable-transparent-huge-pages-thp-in-ubuntu-16-04lts

Are there too many connected users?

Could this be the problem? https://www.techandme.se/performance-tips-for-redis-cache-server/
- 2018-11-15T10:51:02+00:00
SB reporter
S H - in response to your earlier comment, I checked the logs here: log/s6/haraka-smtp/* and there was nothing out of the ordinary.
- 2018-11-15T11:16:52+00:00

SB reporter

Open rc.local (file may not exist on ubuntu 18)

nano /etc/rc.local

Edit the file to include this, or copy it all and make a new file (Ubuntu 18)

#!/bin/sh -e
#
# rc.local
#
# This script is executed at the end of each multiuser runlevel.
# Make sure that the script will "exit 0" on success or any other
# value on error.
#
# In order to enable or disable this script just change the execution
# bits.
#
# By default this script does nothing.
if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
   echo never > /sys/kernel/mm/transparent_hugepage/enabled
fi
sysctl -w net.core.somaxconn=65535
exit 0

Set permissions

chmod +x /etc/rc.local

Modify /etc/sysctl.conf and add:

vm.overcommit_memory=1

Then run:

sudo sysctl -p /etc/sysctl.conf

Reboot

2018-11-17T02:01:01+00:00

SB reporter
Apparently REDIS allocates WAY more memory than it needs under higher loads, and for larger databases. This means it can fail even if there is a lot of memory remaining free.

This has stopped all the errors in the log file, and improved server performance.
- 2018-11-17T02:02:06+00:00
SH repo owner
Please update to latest - there is 2.0.20 already at PRO repository, FREE image is currently building.

I am afraid (and hopeful also) that you've hit same issue as me lately - when internal lmtp delivery fails (for whatever reason) it can cause haraka to be looping over queued email and scanning forever. The log sentence 10000 changes in 60 seconds might suggest that there is some looping happening
- 2018-11-19T17:04:48+00:00
SB reporter
Thank you SH, I will update this evening. You can close this for now if you like.
- 2018-11-19T17:08:04+00:00
SB reporter
This looks like it's worked, server performance even better, thank you.
- 2018-11-20T01:48:48+00:00
SB reporter
- changed status to resolved
- 2018-11-21T15:47:00+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: resolved

Votes: 0

Watchers: 1