piler-smtp.c: random SIGSEGV

Issue #1033 closed

Rolf Weber created an issue 2019-11-26

Hi!

I’m experiencing random(?) intermittent segfaults on one of my piler machines.
I had almost daily crashes a while ago but never got around to debug this. Not a big deal, the segfaults vanished on their own (strange…).
Since a couple of days they’re back - unfortunately (see attached log excerpts).

The 'received: ' entries struck my as looking suspicious. So i turned on coredumps etc.. However, since there is a signal handler installed ( https://bitbucket.org/jsuto/piler/src/538019632510868ffb9dc9324dc8fe1020a83baa/src/piler-smtp.c#lines-194 ), this was not very useful. So next I ran piler-smtp in gdb and waited for the next crash (see attached gdb output).

session->ttmpfile actually seems mangled to me, AFAICT there is no way that make_random_string() can generate that kind of filename.
session->mailfrom looked strange to me at first glance as well, until i looked at extractEmail(). Now the contents makes more sense to me and appears to be OK.
The crash/segfault actually happens in SSL_write, however session->ttmpfile was already mangled when it was copied for send_smtp_response() (cf. #2 in the trace).
readbuf in main()'s stack frame appears to contain a perfectly valid email, terminated by a section divider followed by “\r\n.\r\n” (as it should be).

Sorry but I can’t provide you with the complete core file as this contains customers' data. What I can do however, is provide excerpts.
Can you provide any pointers as to where I should look next? I’m not that familiar with the code and its structure yet.

I also filed 2 other bugs (~~#1031~~ and ~~#1032~~) that I stumbled upon while researching this - they should however be unrelated.

Cheers,
Rolf Weber

Version info:
root@piler:~# /usr/local/sbin/piler-smtp -V
1.3.6 build 998
root@piler:~# /usr/local/sbin/piler -V
piler 1.3.6, build 998, Janos SUTO sj@acts.hu

Build Date: Wed Oct 30 08:57:27 CET 2019
ldd version: ldd (Debian GLIBC 2.24-11+deb9u3) 2.24
gcc version: gcc version 6.3.0 20170516 (Debian 6.3.0-18+deb9u1)
OS: Linux piler 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64 GNU/Linux
Configure command: ./configure --localstatedir=/var --with-database=mysql --enable-starttls --enable-tcpwrappers
MySQL client library version: 10.1.41-MariaDB
Extractors: /usr/bin/pdftotext /usr/bin/catdoc /usr/bin/catppt /usr/bin/xls2csv /usr/bin/unrtf /usr/bin/tnef

Comments (7)

Rolf Weber reporter
- edited description
- 2019-11-26T15:22:28+00:00
Rolf Weber reporter
Hi again!

After having had another look at this I now believe to have found a possible cause for the SIGSEGV. I now believe there is a race between check_for_client_timeout() (handler for SIGALRM) and the SMTP processing.
Assume piler-smtp.c+287: session = get_session_by_socket(sessions, cfg.max_connections, events[i].data.fd); returns a valid session and processing starts. Furthermore assume smtp.c+267: void process_command_period(struct smtp_session *session) is reached. The call to fsync() may block for quite some time. If during this time SIGALRM triggers check_for_client_timeout() it will kill the connection (and void associated session structs). The call send_smtp_response(session, buf); in turn will reach SSL_write();. SSL_write() will segfault if called with an invalid session context.
I’m not quite sure if this would explain the mangled session->ttmpfile string, but it might if the timing is right.

In the logs I noticed client timeouts with corresponding fds just before the segfault. Also the timestamps seem to correlate with high I/O-load (and therefore higher than normal I/O-latency) on the backend storage. This might make fsync() take quite some time.

Cheers,
Rolf Weber
- 2019-11-27T10:48:04+00:00
Janos SUTO repo owner
Hmm, this actually makes sense. The piler-smtp daemon checks for dead connections every 20 seconds., and the default smtp timeout is 60 sec which usually should be fine.

Assuming the high IO load is not something you can easily fix, I think the resolution in smtp.c could be protecting fsync( ) and the subsequent close() from the alarm signal, eg.

‌
```
+   sig_block(SIGALRM);
    fsync(session->fd);
    close(session->fd);
+   sig_unblock(SIGALRM);
```
It may be worth to check how long it takes to fsync() and close() so I added a logging line like

‌
```
syslog(LOG_PRIORITY, "fsync()+close() took %ld [us]", tvdiff(tv2, tv1));
```
So I’d like you to get and compile the master branch, and try the mentioned fix. Also I’m interested in your numbers how long it takes to do fsync( ) and close() the descriptor.
- 2019-11-27T20:18:22+00:00
Rolf Weber reporter
If in fact the timeout handler kills the SSL-socket, this will not resolve the issue. sig_block()/sigprocmask() just deferrs delivery of the signal IIRC (state pending). The unblock will deliver the pending signal.
IMHO it would have to be blocked until after the SSL_write(), so process_command_period() as a whole would become uninterruptible by the timeout handler. I will have to think a little bit more about what to do.
Anyhow, measuring the time it takes to write out the file might actually be helpful for now - it may also be stuck somewhere else.
- 2019-11-28T16:33:35+00:00
Janos SUTO repo owner
I’ve been also thinking about it. It makes sense to protect the whole process_command_period(), however it’s been called from process_data() in smtp.c where we either write data to the file descriptor or dealing with the traling period and send some smtp response back to the client. So I wonder if deferring the alarm signal in session.c around process_data() would be more appropriate, eg. I have the following in mind, see around https://bitbucket.org/jsuto/piler/src/master/src/session.c#lines-210 (not committed yet):

‌
```
        if(session->protocol_state == SMTP_STATE_DATA && (rc == OK || puf[0] != '.')){
            sig_block(SIGALRM);
            process_data(session, puf, puflen);
            sig_unblock(SIGALRM);
         }
```
‌
- 2019-11-30T09:20:19+00:00
Janos SUTO repo owner
I’ve made a few commits today, and I believe that this issue is fixed on the master branch. Let me know if you still experience this issue.
- 2019-12-15T14:09:18+00:00
Janos SUTO repo owner
- changed status to closed
Be sure to reopen the issue if you still experience such problem.
- 2020-01-25T11:01:38+00:00
Log in to comment

Assignee: –

Type: bug

Priority: major

Status: closed

Votes: 0

Watchers: 1