CPU usage 100% and RSPAMD 'I/O' time out

Issue #502 resolved
SB created an issue

Does anyone know what would be causing this?

We had an issue today where the Queue seemed to get stuck and mail struggled to deliver for 3 hours. CPU usage was 100% with about 30 mails in the queue (some were still delivering but some seemed to be stuck). I restarted the container 2x with only a small effect, pretty soon it jammed up again. Then I cleared about 5 of the emails - one by one - (all large with 20mb+ attachments) and the rest started to deliver and CPU usage went back to normal within 10 minutes. Any ideas? How can I make sure this doesn't happen again?

lookup timeout image.PNG

Comments (24)

  1. SB reporter

    Port 25 and IPv4 on the diagnostics was also intermittently showing as red / with a warning something like port 25 connected but no message text received (or something similar)

  2. Scott MacDonald

    Some kind of attack against your server? What do Poste.io logs show during the timeframe... lots of invalid connections?

  3. SB reporter

    auth log has many in …


    Nov 8 14:03:14 mail dovecot: lmtp: Error: sendmail: cannot connect to 127.0.0.1, port 25: Connection refused

    Nov 8 14:03:14 mail dovecot: lmtp: Error: sendmail: could not send mail (account default from /etc/msmtprc)


    Nov 8 14:06:18 mail dovecot: lmtp(15254): Connect from 127.0.0.1

    Nov 8 14:06:18 mail dovecot: lmtp(18851): Connect from 127.0.0.1

    Nov 8 14:06:18 mail dovecot: lmtp(13850): Connect from 127.0.0.1

    Nov 8 14:06:18 mail dovecot: lmtp(18360): Connect from 127.0.0.1

    Nov 8 14:06:18 mail dovecot: lmtp(18469): Connect from 127.0.0.1

    Nov 8 14:06:18 mail dovecot: lmtp(18689): Connect from 127.0.0.1

    Nov 8 14:06:18 mail dovecot: lmtp(19899): Connect from 127.0.0.1

    Nov 8 14:06:18 mail dovecot: lmtp(19900): Connect from 127.0.0.1

    Nov 8 14:06:18 mail dovecot: lmtp(19902): Connect from 127.0.0.1


    Nov 8 14:13:27 mail dovecot: lmtp: Error: cannot connect to 127.0.0.1, port 25: Connection refused

    Nov 8 14:13:27 mail dovecot: lmtp: Error: cannot connect to 127.0.0.1, port 25: Connection refusedsendmail: could not send mail (account default from /etc/msmtprc)

    Nov 8 14:13:27 mail dovecot: lmtp: Error:

    Nov 8 14:13:27 mail dovecot: lmtp: Error: sendmail: could not send mail (account default from /etc/msmtprc)

    Nov 8 14:13:27 mail dovecot: lmtp(<<email address>>): Error: Sendmail process terminated abnormally, exit status 75


    Nov 8 14:15:59 mail dovecot: lmtp(<<email address>): aCoZO4tE5Fu4RwAASs3S/g: sieve: Execution of script /data/domains/<<domain>>/<<address>>/.dovecot.sieve failed, but implicit keep was successful (user logfile /data/domain$

    Nov 8 14:15:59 mail dovecot: lmtp(<<email address>): Error: rename(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock, /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory

    Nov 8 14:15:59 mail dovecot: lmtp(<<email address>): Warning: Our dotlock file /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock was deleted (kept it 133 secs)

    Nov 8 14:15:59 mail dovecot: lmtp(<<email address>): Error: file_dotlock_replace(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory


    Nov 8 15:01:01 mail dovecot: lmtp(<<address>>): Error: YGUXLXRP5Fs1fAAASs3S/g: sieve: Execution of script /data/domains/<<domain>>/<<address>>/.dovecot.sieve was aborted due to temporary failure (user logfile$

    Nov 8 15:01:01 mail dovecot: lmtp(<<address>>): Error: aNT3NDZP5FuQdAAASs3S/g: sieve: msgid=<<MSGIDHERE>>: failed to redirect message to <<email address>>…

    Nov 8 15:01:01 mail dovecot: lmtp(<<address>>): Error: aNT3NDZP5FuQdAAASs3S/g: sieve: Execution of script /data/domains/<<domain>>/<<address>>/.dovecot.sieve was aborted due to temporary failure (user logfile /data/dom$

    Nov 8 15:01:01 mail dovecot: lmtp(<<address>>): Warning: Our dotlock file /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock was overridden (kept it 102 secs)

    Nov 8 15:01:01 mail dovecot: lmtp(29840): Disconnect from 127.0.0.1: Connection closed (in DATA finished)

    Nov 8 15:01:01 mail dovecot: lmtp: Error: sendmail: sendmail: cannot connect to 127.0.0.1, port 25: Connection refused

    Nov 8 15:01:01 mail dovecot: lmtp: Error: sendmail: could not send mail (account default from /etc/msmtprc)

    Nov 8 15:01:01 mail dovecot: lmtp: Error: cannot connect to 127.0.0.1, port 25: Connection refused


    Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Error: rename(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock, /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory

    Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Warning: Our dotlock file /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock was deleted (kept it 285 secs)

    Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Error: file_dotlock_replace(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory

    Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): SEa6OZ1T5FuMBwAASs3S/g: sieve: msgid=<<MSGIDHERE>>: stored mail into mailbox 'INBOX'

    Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): SEa6OZ1T5FuMBwAASs3S/g: sieve: Execution of script /data/domains/<<domain>>/<<address>>/.dovecot.sieve failed, but implicit keep was successful (user logfile /data/domain$

    Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Error: rename(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock, /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory

    Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Warning: Our dotlock file /data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes.lock was deleted (kept it 272 secs)

    Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): Error: file_dotlock_replace(/data/domains/<<domain>>/<<address>>/.dovecot.lda-dupes) failed: No such file or directory

    Nov 8 15:22:35 mail dovecot: lmtp(<<address>>): mLeeOZ1T5FuLBwAASs3S/g: sieve: msgid=<<MSGIDHERE>>: stored mail into mailbox 'INBOX'

  4. Scott MacDonald

    The graphs might give you an indication.. if not check log file sizes.. that might hint at if you're seeing more inbound traffic than normal.

  5. SH repo owner

    Do you have emails which caused this? From your description it seems that email is internally processed with sieve and then smtp communication fails.

    Logs which might help:

    • log/s6/haraka-smtp/* - there might be any indication of exception SMTP communication around that time, I am almost sure that there will be primary cause logged
    • log/delivery/del/* - if there is "rotating" email in queue, it should generate large log - it is easy to find largest file with something like find / -type f -printf "%s\t%p\n" | sort -n | tail -10
  6. SB reporter

    Thank you both for the info above, it seems to be OK today, but had a v. busy day and not had time to check the suggestion above yet. I will take a more in depth look over the weekend and update.

  7. SB reporter

    I will close this pending the outcome of my investigation. It looks like a client (Outlook 2016) was stuck in a forwarding loop for a 23mb email,

  8. SB reporter
    • changed status to open

    OK - this looks like its Redis DB which is causing this. The same issue has just happened again. The server was delivering multiple emails to people again, and it appears there was an issue where the following error was coming up in RSPAMD error log -


    MISCONF Redis is configured to save RDB snapshots, but is currently not able to persist on disk. Commands that may modify the data set are disabled. Please check Redis logs for details about the error.


    Redis folder inside /data/redis/

    redis.PNG


    Restarting the server (uptime was 6 days) seemed to fix this, but the first login to the admin area gave a 500 error for a couple of minutes (didnt get a screenshot) but saying the same kind of thing as the following RSPAND error. After a couple of minutes then it seemed to login and work.

    rspand errors.PNG

    After this it seems to work normal, but now gives an error saying :

    rspamd error 2.PNG


    Is this a resource issue? Server has 3.5 GB memory (900mb currently free - 650mb at time of errors before rebooting) and is a dedicated (not burstable) VM on Azure (DSv1)

  9. SB reporter

    S H - in response to your earlier comment, I checked the logs here: log/s6/haraka-smtp/* and there was nothing out of the ordinary.

  10. SB reporter

    Open rc.local (file may not exist on ubuntu 18)

    nano /etc/rc.local
    

    Edit the file to include this, or copy it all and make a new file (Ubuntu 18)

    #!/bin/sh -e
    #
    # rc.local
    #
    # This script is executed at the end of each multiuser runlevel.
    # Make sure that the script will "exit 0" on success or any other
    # value on error.
    #
    # In order to enable or disable this script just change the execution
    # bits.
    #
    # By default this script does nothing.
    if test -f /sys/kernel/mm/transparent_hugepage/enabled; then
       echo never > /sys/kernel/mm/transparent_hugepage/enabled
    fi
    sysctl -w net.core.somaxconn=65535
    exit 0
    

    Set permissions

    chmod +x /etc/rc.local
    

    Modify /etc/sysctl.conf and add:

    vm.overcommit_memory=1
    

    Then run:

    sudo sysctl -p /etc/sysctl.conf
    

    Reboot

  11. SB reporter

    Apparently REDIS allocates WAY more memory than it needs under higher loads, and for larger databases. This means it can fail even if there is a lot of memory remaining free.

    This has stopped all the errors in the log file, and improved server performance.

  12. SH repo owner

    Please update to latest - there is 2.0.20 already at PRO repository, FREE image is currently building.

    I am afraid (and hopeful also) that you've hit same issue as me lately - when internal lmtp delivery fails (for whatever reason) it can cause haraka to be looping over queued email and scanning forever. The log sentence 10000 changes in 60 seconds might suggest that there is some looping happening

  13. Log in to comment