piler-smtp segmentation fault

Issue #1067 resolved
Greg Wurst created an issue

New install of Piler 1.3.8 on Ubuntu 18.0.4 from last week. Everything was working fine, but I checked the system today and no new mail since the 20th. I restarted the server and fortunately Google sent the queued mail to the server, so it doesn’t look like we missed any. Here’s what I found from the logs:

piler -V

piler 1.3.8, build 998, Janos SUTO <sj@acts.hu>

Build Date: Wed Apr 15 23:47:58 UTC 2020
ldd version: ldd (Ubuntu GLIBC 2.27-3ubuntu1) 2.27
gcc version: gcc version 7.5.0 (Ubuntu 7.5.0-3ubuntu1~18.04)
OS: Linux piler 4.15.0-96-generic #97-Ubuntu SMP Wed Apr 1 03:25:46 UTC 2020 x86_64 x86_64 x86_64 GNU/Linux
Configure command: ./configure --localstatedir=/var --with-database=mysql --enable-tcpwrappers
MySQL client library version: 5.7.29
Extractors: /usr/bin/pdftotext /usr/bin/catdoc /usr/bin/catppt /usr/bin/xls2csv /usr/local/bin/ppthtml /usr/bin/unrtf /usr/bin/tnef libzip

mail.log and syslog had the same thing

Apr 20 18:03:26 piler piler-smtp[1682]: connected from 209.85.216.70:41034 on fd=6 (active connections: 1)
Apr 20 18:04:28 piler piler-smtp[1682]: received: RKI3205CTJFFKI3X, from=delivery@mailer.njbiz.com, size=45829, client=209.85.216.70, fd=6, fsync=60625140
Apr 20 18:04:28 piler piler-smtp[1682]: client 209.85.216.70 timeout, lasttime: 1587405807
Apr 20 18:04:28 piler piler-smtp[1682]: disconnected from 209.85.216.70 on fd=6 (0 active connections)
Apr 20 18:04:28 piler piler-smtp[1682]: got signal: 11, Segmentation fault
Apr 20 18:04:28 piler piler-smtp[1682]: piler-smtp has been terminated
Apr 20 18:04:29 piler piler[1686]: 0/RKI3205CTJFFKI3X: 400000005e9de43701a5361400f3515ad198, size=45829/9080, attachments=0, reference=, message-id=<XYCXCVCBX$UYBAXTYBTWVVW@mail-out.via.act-on.net>, retention=2557, folder=0, delay=0.0278, status=stored

kern.log (only thing near that time, probably not relevant)

Apr 21 18:14:03 piler kernel: [496875.374008] audit: type=1400 audit(1587492843.098:29): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/snap/core/9066/usr/lib/snapd/snap-confine" pid=2424 comm="apparmor_parser"
Apr 21 18:14:03 piler kernel: [496875.378979] audit: type=1400 audit(1587492843.102:30): apparmor="STATUS" operation="profile_load" profile="unconfined" name="/snap/core/9066/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=2424 comm="apparmor_parser"
Apr 21 18:14:03 piler kernel: [496875.877573] audit: type=1400 audit(1587492843.602:31): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.core.hook.configure" pid=2573 comm="apparmor_parser"
Apr 21 18:14:03 piler kernel: [496875.897973] audit: type=1400 audit(1587492843.622:32): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap-update-ns.core" pid=2627 comm="apparmor_parser"
Apr 21 18:14:07 piler kernel: [496879.803702] audit: type=1400 audit(1587492847.526:33): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/core/9066/usr/lib/snapd/snap-confine" pid=2744 comm="apparmor_parser"
Apr 21 18:14:07 piler kernel: [496879.821022] audit: type=1400 audit(1587492847.546:34): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="/snap/core/9066/usr/lib/snapd/snap-confine//mount-namespace-capture-helper" pid=2744 comm="apparmor_parser"
Apr 21 18:14:07 piler kernel: [496880.074414] audit: type=1400 audit(1587492847.798:35): apparmor="STATUS" operation="profile_replace" profile="unconfined" name="snap.core.hook.configure" pid=2746 comm="apparmor_parser"
Apr 21 18:14:07 piler kernel: [496880.084867] audit: type=1400 audit(1587492847.806:36): apparmor="STATUS" operation="profile_replace" info="same as current profile, skipping" profile="unconfined" name="snap-update-ns.core" pid=2748 comm="apparmor_parser"

I’m going to check the system for new messages every couple days for a while to see if this happens again. Anything I should look into?

Comments (25)

  1. Janos SUTO repo owner

    Please describe the environment, eg. do you run it on a physical host or in the cloud / virtual machine / docker container, etc.

  2. Greg Wurst reporter

    VM running on vSphere 5.5. Old Piler OVA ran on the system for several years. DB crashed and built a new updated replacement.

    Hoping it’s just a fluke, but even if it does occasionally crash if I can find a way to have the system restart the piler service if it crashes that would be fine as well. That or at least send a notification email.

  3. Greg Wurst reporter

    FYI - just crashed again. Restarting piler service got it back working.

    Apr 24 10:56:26 piler piler-smtp[1828]: client 209.85.222.72 timeout, lasttime: 1587740126
    Apr 24 10:56:26 piler piler-smtp[1828]: disconnected from 209.85.222.72 on fd=9 (0 active connections)
    Apr 24 10:56:26 piler piler-smtp[1828]: got signal: 11, Segmentation fault
    Apr 24 10:56:26 piler piler-smtp[1828]: piler-smtp has been terminated
    Apr 24 10:56:26 piler piler[1843]: 0/4ROEIKA0I82CCMH5: 400000005ea2fde82f39590c00d18bf7ec7e, size=71796/15232, attachments=0, reference=, message-id=<922730018.-1994384669.1587673091270.JavaMail.mktmail@sjmas02.marketo.org>, retention=2557, folder=0, delay=59.3220, status=stored
    Apr 24 10:56:27 piler piler[1843]: 0/Z1SI5XOLR73E502C: 400000005ea2fe2506e5fbcc00c06be42a72, size=45046/8096, attachments=1, reference=<a6756c414194ac669a772aa1723fb97f@mail.gmail.com>, message-id=<CAKmqpu5QeGMJGmYO1FrOYJS6xRNJE37Ke7dmr_nimMUiQV9xCg@mail.gmail.com>, retention=2557, folder=0, delay=0.0487, status=stored
    

  4. Janos SUTO repo owner

    Signal 11 is SIGSEGV, indicating a segmentation fault when the process attempts an invalid memory reference. Please run piler-smtp with verbosity=5 to gain more info what’s going on.

  5. Greg Wurst reporter

    I assume that’s set in /usr/local/etc/piler/piler.conf ? I increased verbosity to 5 and restarted Piler. I’ll let you know what I find when/if it crashes again.

    Thanks,

  6. Greg Wurst reporter

    Naturally, once I increased the logging it decides to work flawlessly. At the moment I have nothing new to report. I’ll keep the debug logging high and if it crashes again hopefully I’ll have something useful I can provide you.

  7. Greg Wurst reporter

    Look like it crashed at 9:01 today.

    Mail.log

    May 19 09:01:01 piler piler-smtp[1766]: received: VYSH0RYC9JBOPPTM, from=tpetrosino@marketvisionresearch.com, size=28837, client=209.85.167.72, fd=6, fsync=62971318
    May 19 09:01:01 piler piler-smtp[1766]: sent on fd=6: 250 OK <VYSH0RYC9JBOPPTM>
    May 19 09:01:01 piler piler-smtp[1766]: client 209.85.167.72 timeout, lasttime: 1589893197
    May 19 09:01:01 piler piler-smtp[1766]: disconnected from 209.85.167.72 on fd=6 (0 active connections)
    May 19 09:01:01 piler piler-smtp[1766]: got signal: 11, Segmentation fault
    May 19 09:01:01 piler piler-smtp[1766]: piler-smtp has been terminated
    May 19 09:01:01 piler piler[31873]: 400000005ec3d8970b85f18c00f93dff4d1f: hdr_len: 2255, offset: 3
    May 19 09:01:01 piler piler[31873]: 400000005ec3d8970b85f18c00f93dff4d1f: touch 44c5daa997499ff3ba9f81c03154b539740c92caac96505661d700a935290681 OK (<ebd1caf169f0a2a3c8d3807dec910b95@mail.gmail.com>)
    May 19 09:01:01 piler piler[31873]: 400000005ec3d8970b85f18c00f93dff4d1f: stored '400000005ec3d8970b85f18c00f93dff4d1f.m' 28837/7024 bytes
    May 19 09:01:01 piler piler[31873]: 400000005ec3d8970b85f18c00f93dff4d1f: updated meta reference for '<MWHPR04MB0559C5644A409FE814F0484B89BA0@MWHPR04MB0559.namprd04.prod.outlook.com>', rc=0
    May 19 09:01:01 piler piler[31873]: 400000005ec3d8970b85f18c00f93dff4d1f: stored 1 recipients, rc=0
    May 19 09:01:01 piler piler[31873]: 400000005ec3d8970b85f18c00f93dff4d1f: stored metadata, rc=0
    May 19 09:01:01 piler piler[31873]: 0/VYSH0RYC9JBOPPTM: 400000005ec3d8970b85f18c00f93dff4d1f, size=28837/7024, attachments=0, reference=<MWHPR04MB0559C5644A409FE814F0484B89BA0@MWHPR04MB0559.namprd04.prod.outlook.com>, message-id=<ebd1caf169f0a2a3c8d3807dec910b95@mail.gmail.com>, retention=2557, folder=0, delay=0.1262, status=stored
    

    syslog:

    May 19 09:01:01 piler CRON[5517]: (piler) CMD (/usr/bin/find /var/piler/error -type f|wc -l > /var/piler/stat/error)
    May 19 09:01:01 piler CRON[5516]: (piler) CMD (/usr/bin/find /var/piler/www/tmp -type f -name i.\* -exec rm -f {} \;)
    May 19 09:01:01 piler CRON[5518]: (piler) CMD (/usr/bin/indexer --quiet note1 --rotate --config /usr/local/etc/piler/sphinx.conf)
    May 19 09:01:01 piler CRON[5519]: (piler) CMD (/usr/bin/indexer --quiet tag1 --rotate --config /usr/local/etc/piler/sphinx.conf)
    May 19 09:01:01 piler piler-smtp[1766]: received: VYSH0RYC9JBOPPTM, from=tpetrosino@marketvisionresearch.com, size=28837, client=209.85.167.72, fd=6, fsync=62971318
    May 19 09:01:01 piler piler-smtp[1766]: sent on fd=6: 250 OK <VYSH0RYC9JBOPPTM>
    May 19 09:01:01 piler piler-smtp[1766]: client 209.85.167.72 timeout, lasttime: 1589893197
    May 19 09:01:01 piler piler-smtp[1766]: disconnected from 209.85.167.72 on fd=6 (0 active connections)
    May 19 09:01:01 piler piler-smtp[1766]: got signal: 11, Segmentation fault
    May 19 09:01:01 piler piler-smtp[1766]: piler-smtp has been terminated
    May 19 09:01:01 piler piler[31873]: 400000005ec3d8970b85f18c00f93dff4d1f: hdr_len: 2255, offset: 3
    May 19 09:01:01 piler piler[31873]: 400000005ec3d8970b85f18c00f93dff4d1f: touch 44c5daa997499ff3ba9f81c03154b539740c92caac96505661d700a935290681 OK (<ebd1caf169f0a2a3c8d3807dec910b95@mail.gmail.com>)
    May 19 09:01:01 piler piler[31873]: 400000005ec3d8970b85f18c00f93dff4d1f: stored '400000005ec3d8970b85f18c00f93dff4d1f.m' 28837/7024 bytes
    May 19 09:01:01 piler piler[31873]: 400000005ec3d8970b85f18c00f93dff4d1f: updated meta reference for '<MWHPR04MB0559C5644A409FE814F0484B89BA0@MWHPR04MB0559.namprd04.prod.outlook.com>', rc=0
    May 19 09:01:01 piler piler[31873]: 400000005ec3d8970b85f18c00f93dff4d1f: stored 1 recipients, rc=0
    May 19 09:01:01 piler piler[31873]: 400000005ec3d8970b85f18c00f93dff4d1f: stored metadata, rc=0
    May 19 09:01:01 piler piler[31873]: 0/VYSH0RYC9JBOPPTM: 400000005ec3d8970b85f18c00f93dff4d1f, size=28837/7024, attachments=0, reference=<MWHPR04MB0559C5644A409FE814F0484B89BA0@MWHPR04MB0559.namprd04.prod.outlook.com>, message-id=<ebd1caf169f0a2a3c8d3807dec910b95@mail.gmail.com>, retention=2557, folder=0, delay=0.1262, status=stored
    

    kern.log was empty.

  8. Greg Wurst reporter

    The Piler SMTP service still occasionally crashes. Maybe every couple weeks or so. The logs haven’t had anything enlightening in them, so I’ve taken to just checking the Piler status page once a day and checking the SMTP status. If it’s not OK I just restart the piler service. As long as it isn’t down for 4 days straight Google will still send the copy of our email.

    Still no idea what causes the occasional crash. Only thing that would be handy would be some way for the system to see the piler SMTP service not operating and automatically restarting the piler service itself. As-is I’m fine with just checking manually. It’s probably good to make sure things are working every day.

  9. Janos SUTO repo owner

    Perhaps systemd could restart it in that case. Check out the systemd folder in the piler source dir. You may need to adjust the service file for paths, etc.

  10. Greg Wurst reporter

    I updated that server to 1.3.9 build 998 currently. The problem still exists, but I want to make sure you recommend trying this patch for 1.3.9 as well before I potentially break something. At the moment I have that server configured to reboot nightly in case it crashes, which I’ll turn-off if I install this patch and keep an eye on it.

    Thanks again,

  11. Greg Wurst reporter

    Ok, I’ve applied the patch and disabled the nightly reboots. I’ll keep an eye on it and let you know.

    Thanks for your help,

  12. Greg Wurst reporter

    No luck, I’m afraid.

    Feb 11 10:46:00 piler piler-smtp[2026]: received: 0YOMB0ULBQ06CN61, from=mbarford@marketvisionresearch.com, size=635913, client=209.85.221.71, fd=6, fsync=62988875
    Feb 11 10:46:00 piler piler-smtp[2026]: sent on fd=6: 250 OK <0YOMB0ULBQ06CN61>
    Feb 11 10:46:00 piler piler-smtp[2026]: client 209.85.221.71 timeout, lasttime: 1613058297
    Feb 11 10:46:00 piler piler-smtp[2026]: disconnected from 209.85.221.71 on fd=6, slot=0 (1 active connections)
    Feb 11 10:46:00 piler piler-smtp[2026]: client 209.85.128.70 timeout, lasttime: 1613058297
    Feb 11 10:46:00 piler piler-smtp[2026]: disconnected from 209.85.128.70 on fd=7, slot=1 (0 active connections)
    Feb 11 10:46:00 piler piler-smtp[2026]: Removing R7IDFOCM8ZAWLR79
    Feb 11 10:46:01 piler piler-smtp[2026]: got signal: 11, Segmentation fault
    Feb 11 10:46:01 piler piler-smtp[2026]: piler-smtp has been terminated
    Feb 11 10:46:01 piler piler[2035]: 4000000060255143128595b4003ce94a13a3: attachment list: i:1, name=image001.png, type: image/png, size: 6204, int.name: 4000000060255143128595b4003ce94a13a3.a1, digest: dfa1d38bf02600ff1ac694bfb23dbf724f57a148b25677173ce226890b1257f4
    Feb 11 10:46:01 piler piler[2035]: 4000000060255143128595b4003ce94a13a3: attachment list: i:2, name=image002.png, type: image/png, size: 6332, int.name: 4000000060255143128595b4003ce94a13a3.a2, digest: 265ac1b19d2f0595d2e27e7aff43be4237619ba0dfd6be03b5391bd4d0f9e46c
    Feb 11 10:46:01 piler piler[2035]: 4000000060255143128595b4003ce94a13a3: attachment list: i:3, name=image003.png, type: image/png, size: 6332, int.name: 4000000060255143128595b4003ce94a13a3.a3, digest: abfc6e6a97e238733059743448504adae318dcc6e15feabd33cb08d13f2f3a00
    Feb 11 10:46:01 piler piler[2035]: 4000000060255143128595b4003ce94a13a3: attachment list: i:4, name=image004.png, type: image/png, size: 6340, int.name: 4000000060255143128595b4003ce94a13a3.a4, digest: 968cadbc605616d8c2544a5f3e7197d7bb9b226f32809c9ab2f165b3ea966ef1
    Feb 11 10:46:01 piler piler[2035]: 4000000060255143128595b4003ce94a13a3: attachment list: i:5, name=image005.png, type: image/png, size: 6328, int.name: 4000000060255143128595b4003ce94a13a3.a5, digest: 576f9d422c45f46064b99666583772ecc4ed7347a8f40a58874349d07e107ba5
    Feb 11 10:46:01 piler piler[2035]: 4000000060255143128595b4003ce94a13a3: attachment list: i:6, name=image006.png, type: image/png, size: 6332, int.name: 4000000060255143128595b4003ce94a13a3.a6, digest: 25a97fcf086ece38d5e0f3276c805545a9d478c9dab7e1d24277d3499a66703c
    Feb 11 10:46:01 piler piler[2035]: 4000000060255143128595b4003ce94a13a3: attachment list: i:7, name=image007.jpg, type: image/jpeg, size: 73892, int.name: 4000000060255143128595b4003ce94a13a3.a7, digest: deb6dda019e5af23346ee36f99ec3829075c846b0125e0035b1197f5c5121b5c
    Feb 11 10:46:01 piler piler[2035]: 4000000060255143128595b4003ce94a13a3: hdr_len: 3440, offset: 3
    Feb 11 10:46:01 piler piler[2035]: 4000000060255143128595b4003ce94a13a3: touch a15c129a966ca3dbff589f8324bb54539fa437529387b8c97338db35738c3a31 OK (25c82f43da8731cebe555c27411a95c9@mail.gmail.com)
    Feb 11 10:46:05 piler piler[2035]: 4000000060255143128595b4003ce94a13a3: stored '4000000060255143128595b4003ce94a13a3.m' 524629/62352 bytes
    Feb 11 10:46:05 piler piler[2035]: 4000000060255143128595b4003ce94a13a3: updated meta reference for 'fbad2d28f6d1482d911f7dbac77e65d0@CRCexch01.crcvan.ca', rc=0
    Feb 11 10:46:05 piler piler[2035]: 4000000060255143128595b4003ce94a13a3: stored 2 recipients, rc=0
    Feb 11 10:46:05 piler piler[2035]: 4000000060255143128595b4003ce94a13a3: stored metadata, rc=0
    Feb 11 10:46:05 piler piler[2035]: 0/0YOMB0ULBQ06CN61: 4000000060255143128595b4003ce94a13a3, size=635913/62352, attachments=7, reference=fbad2d28f6d1482d911f7dbac77e65d0@CRCexch01.crcvan.ca, message-id=25c82f43da8731cebe555c27411a95c9@mail.gmail.com, retention=2557, folder=0, delay=4.3905, status=stored

  13. Janos SUTO repo owner

    Well, then we need to dig deeper to understand what’s going on. I’d like you to get the smtpsegv branch (https://bitbucket.org/jsuto/piler/get/smtpsegv.tar.gz), and recompile piler. Then backup the current piler-smtp, and replace it with the binary compiled on the smtpsegv branch. It will syslogs much more stuff, and hopefully I may find out what the issue is.

    Btw. I tried to reproduce the issue in both docker and libvirt/kvm, however no success, that is piler-smtp survived pausing the vm and the container, and went on just fine. Another user also mentioned the same issue on the mailing list. Hopefully the inputs from both of you will help me understand the problem.

  14. Janos SUTO repo owner

    Today, I’ve just refactored the timeout check for the epoll loop, and updated the smtpsegv branch. I’d like you to download the source code for this branch, recompile piler-smtp, and let’s give it another shot.

  15. Greg Wurst reporter

    Ok, just to make sure I don’t hose everything here let me ask a question.

    1. My assumption on the compile of the smtpsegv branch is it just overwrites all the current binaries and such? Do I need to backup anything in particular or just recompile and let it overwrite? I have the stable release so I guess I could just recompile it and go back if I ran into any troubles?

  16. Janos SUTO repo owner

    I suggest to overwrite only the piler-smtp binary, nothing else. It’s a standalone daemon, and it doesn’t need the libpiler.so library.

  17. Greg Wurst reporter

    Ok, I’ve recompiled and copied over just the new smtp-piler executable. We’ll see how it goes.

    Thanks again,

  18. Janos SUTO repo owner

    Two other users confirmed the fix on the mailing list, so I’ve even merged it to the master branch. If the problem still persisted, then please reopen this issue. Hopefully, it won’t.

  19. Log in to comment