CPU usage 100% and RSPAMD 'I/O' time out
Does anyone know what would be causing this?
We had an issue today where the Queue seemed to get stuck and mail struggled to deliver for 3 hours. CPU usage was 100% with about 30 mails in the queue (some were still delivering but some seemed to be stuck). I restarted the container 2x with only a small effect, pretty soon it jammed up again. Then I cleared about 5 of the emails - one by one - (all large with 20mb+ attachments) and the rest started to deliver and CPU usage went back to normal within 10 minutes. Any ideas? How can I make sure this doesn't happen again?
Comments (24)
-
reporter -
reporter - edited description
-
reporter Port 25 and IPv4 on the diagnostics was also intermittently showing as red / with a warning something like port 25 connected but no message text received (or something similar)
-
Some kind of attack against your server? What do Poste.io logs show during the timeframe... lots of invalid connections?
-
reporter auth log has many in …
Nov 8 14:03:14 mail dovecot: lmtp: Error: sendmail: cannot connect to 127.0.0.1, port 25: Connection refused
Nov 8 14:03:14 mail dovecot: lmtp: Error: sendmail: could not send mail (account default from /etc/msmtprc)
Nov 8 14:06:18 mail dovecot: lmtp(15254): Connect from 127.0.0.1
Nov 8 14:06:18 mail dovecot: lmtp(18851): Connect from 127.0.0.1
Nov 8 14:06:18 mail dovecot: lmtp(13850): Connect from 127.0.0.1
Nov 8 14:06:18 mail dovecot: lmtp(18360): Connect from 127.0.0.1
Nov 8 14:06:18 mail dovecot: lmtp(18469): Connect from 127.0.0.1
Nov 8 14:06:18 mail dovecot: lmtp(18689): Connect from 127.0.0.1
Nov 8 14:06:18 mail dovecot: lmtp(19899): Connect from 127.0.0.1
Nov 8 14:06:18 mail dovecot: lmtp(19900): Connect from 127.0.0.1
Nov 8 14:06:18 mail dovecot: lmtp(19902): Connect from 127.0.0.1
Nov 8 14:13:27 mail dovecot: lmtp: Error: cannot connect to 127.0.0.1, port 25: Connection refused
Nov 8 14:13:27 mail dovecot: lmtp: Error: cannot connect to 127.0.0.1, port 25: Connection refusedsendmail: could not send mail (account default from /etc/msmtprc)
Nov 8 14:13:27 mail dovecot: lmtp: Error:
Nov 8 14:13:27 mail dovecot: lmtp: Error: sendmail: could not send mail (account default from /etc/msmtprc)
Nov 8 14:13:27 mail dovecot: lmtp(<<email address>>): Error: Sendmail process terminated abnormally, exit status 75
Nov 8 14:15:59 mail dovecot: lmtp(<<email address>): aCoZO4tE5Fu4RwAASs3S/g: sieve: Execution of script /data/domains/<<domain>>/<<address>>/.dovecot.sieve failed, but implicit keep was successful (user logfile /data/domain$
Nov 8 14:15:59 mail dovecot: lmtp(<<email address>): Error: rename(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock, /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory
Nov 8 14:15:59 mail dovecot: lmtp(<<email address>): Warning: Our dotlock file /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock was deleted (kept it 133 secs)
Nov 8 14:15:59 mail dovecot: lmtp(<<email address>): Error: file_dotlock_replace(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory
Nov 8 15:01:01 mail dovecot: lmtp(<<address>>): Error: YGUXLXRP5Fs1fAAASs3S/g: sieve: Execution of script /data/domains/<<domain>>/<<address>>/.dovecot.sieve was aborted due to temporary failure (user logfile$
Nov 8 15:01:01 mail dovecot: lmtp(<<address>>): Error: aNT3NDZP5FuQdAAASs3S/g: sieve: msgid=<<MSGIDHERE>>: failed to redirect message to <<email address>>…
Nov 8 15:01:01 mail dovecot: lmtp(<<address>>): Error: aNT3NDZP5FuQdAAASs3S/g: sieve: Execution of script /data/domains/<<domain>>/<<address>>/.dovecot.sieve was aborted due to temporary failure (user logfile /data/dom$
Nov 8 15:01:01 mail dovecot: lmtp(<<address>>): Warning: Our dotlock file /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock was overridden (kept it 102 secs)
Nov 8 15:01:01 mail dovecot: lmtp(29840): Disconnect from 127.0.0.1: Connection closed (in DATA finished)
Nov 8 15:01:01 mail dovecot: lmtp: Error: sendmail: sendmail: cannot connect to 127.0.0.1, port 25: Connection refused
Nov 8 15:01:01 mail dovecot: lmtp: Error: sendmail: could not send mail (account default from /etc/msmtprc)
Nov 8 15:01:01 mail dovecot: lmtp: Error: cannot connect to 127.0.0.1, port 25: Connection refused
Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Error: rename(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock, /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory
Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Warning: Our dotlock file /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock was deleted (kept it 285 secs)
Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Error: file_dotlock_replace(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory
Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): SEa6OZ1T5FuMBwAASs3S/g: sieve: msgid=<<MSGIDHERE>>: stored mail into mailbox 'INBOX'
Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): SEa6OZ1T5FuMBwAASs3S/g: sieve: Execution of script /data/domains/<<domain>>/<<address>>/.dovecot.sieve failed, but implicit keep was successful (user logfile /data/domain$
Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Error: rename(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock, /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory
Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Warning: Our dotlock file /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock was deleted (kept it 272 secs)
Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Error: file_dotlock_replace(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory
Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): mLeeOZ1T5FuLBwAASs3S/g: sieve: msgid=<<MSGIDHERE>>: stored mail into mailbox 'INBOX'
-
The graphs might give you an indication.. if not check log file sizes.. that might hint at if you're seeing more inbound traffic than normal.
-
Which version are you on? Latest v2.0.19
-
reporter Latest Pro
-
reporter Today was an 'average' day server wise, nothing out of the ordinary on stats
-
repo owner Do you have emails which caused this? From your description it seems that email is internally processed with sieve and then smtp communication fails.
Logs which might help:
- log/s6/haraka-smtp/* - there might be any indication of exception SMTP communication around that time, I am almost sure that there will be primary cause logged
- log/delivery/del/* - if there is "rotating" email in queue, it should generate large log - it is easy to find largest file with something like
find / -type f -printf "%s\t%p\n" | sort -n | tail -10
-
Is this is an issue?? Did u have logs as suggested by SH?
-
reporter Thank you both for the info above, it seems to be OK today, but had a v. busy day and not had time to check the suggestion above yet. I will take a more in depth look over the weekend and update.
-
reporter I will close this pending the outcome of my investigation. It looks like a client (Outlook 2016) was stuck in a forwarding loop for a 23mb email,
-
reporter Thank you both kindly for your help.
-
reporter - changed status to resolved
-
reporter - changed status to open
OK - this looks like its Redis DB which is causing this. The same issue has just happened again. The server was delivering multiple emails to people again, and it appears there was an issue where the following error was coming up in RSPAMD error log -
MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.
Redis folder inside /data/redis/
Restarting the server (uptime was 6 days) seemed to fix this, but the first login to the admin area gave a 500 error for a couple of minutes (didnt get a screenshot) but saying the same kind of thing as the following RSPAND error. After a couple of minutes then it seemed to login and work.
After this it seems to work normal, but now gives an error saying :
Is this a resource issue? Server has 3.5 GB memory (900mb currently free - 650mb at time of errors before rebooting) and is a dedicated (not burstable) VM on Azure (DSv1)
-
reporter REDIS log files
Would and more importantly Should this be fixed using: https://stackoverflow.com/questions/44800633/how-to-disable-transparent-huge-pages-thp-in-ubuntu-16-04lts
Are there too many connected users?
Could this be the problem? https://www.techandme.se/performance-tips-for-redis-cache-server/
-
reporter S H - in response to your earlier comment, I checked the logs here: log/s6/haraka-smtp/* and there was nothing out of the ordinary.
-
reporter Open rc.local (file may not exist on ubuntu 18)
nano /etc/rc.local
Edit the file to include this, or copy it all and make a new file (Ubuntu 18)
#!/bin/sh -e # # rc.local # # This script is executed at the end of each multiuser runlevel. # Make sure that the script will "exit 0" on success or any other # value on error. # # In order to enable or disable this script just change the execution # bits. # # By default this script does nothing. if test -f /sys/kernel/mm/transparent_hugepage/enabled; then echo never > /sys/kernel/mm/transparent_hugepage/enabled fi sysctl -w net.core.somaxconn=65535 exit 0
Set permissions
chmod +x /etc/rc.local
Modify /etc/sysctl.conf and add:
vm.overcommit_memory=1
Then run:
sudo sysctl -p /etc/sysctl.conf
Reboot
-
reporter Apparently REDIS allocates WAY more memory than it needs under higher loads, and for larger databases. This means it can fail even if there is a lot of memory remaining free.
This has stopped all the errors in the log file, and improved server performance.
-
repo owner Please update to latest - there is 2.0.20 already at PRO repository, FREE image is currently building.
I am afraid (and hopeful also) that you've hit same issue as me lately - when internal lmtp delivery fails (for whatever reason) it can cause haraka to be looping over queued email and scanning forever. The log sentence
10000 changes in 60 seconds
might suggest that there is some looping happening -
reporter Thank you SH, I will update this evening. You can close this for now if you like.
-
reporter This looks like it's worked, server performance even better, thank you.
-
reporter - changed status to resolved
- Log in to comment
You can see where we had the issues: