Piler stops accepting connections after ~ 15 minutes

Issue #973 closed

Tom Collins created an issue 2019-03-23

So this is a strange one. I am currently running a POC using Piler, with a throughput of around 80,000 messages per day.

I've had it running absolutely fine for a week, but a couple of days ago I logged in and it showed that no messages had been received for for 6 hours, along with the "SMTP Status: piler: ERROR", message in the GUI.

So, I went ahead and tried to telnet to it on 25. It just had the trying to connect message, but never actually did.

I checked the maillog, and I could see that the messages were coming in normally, then the number of active connects just decreased over a period of an hour, and eventually the only messages were relating to the piler internal tasks, with zero connections.

SO! I restarted the piler service, and it started accepting mail again. However, after about 15 minutes, the number of connections once again reduced to 0 and I could no longer telnet to it. Netstat shows the ports is available, and another restart of the piler service seems to work again for another 15 minute .

I can't find any kind of error in any of the linux logs. There are no other SMTP application on the server. I'm not entirely sure where to look. I was actually days away standing this up in production, but I hate not being able to see the route cause of an issue in logs.

Is this anything anyone has experienced before? Any ideas where I could look?

Comments (13)

Tom Collins reporter
- changed title to Piler stops accepting connections after ~ 15 minutes
- 2019-03-23T14:17:53+00:00
Tom Collins reporter
So, an update (sorry if these messages are long!)

I modified the max_connections attribute in piler.conf, and the_number_of_worker_processes.

What I found was when Exchange online has a queue of journal messages, it will establish around 650 connections (I experimented with max connections up to 1000!). Whatever number I choose, it just starts immediately decreasing to 0 (so it maxes out first, then just decreases one at a time).

I've today stood up a second piler instance, and actually set the max connections to 32. This, thus far, has been working for an hour, with obviously quite a lot of 'too many connections' errors cropping up. I expect these to disappear eventually, as I find our environment hovers around 20 connections in normal conditions.

I guess I'm just after a couple of things;
- Where could I find information about why it keeps chocking (I've checked every log I can think of)
- Why do you think this might be happening? Is it some kind of keep-alive setting or something related to the ports etc?
- What performance tweaking do you recommend? Are there limitations with the max_connections setting? I have been unable to get my server utilization to go above 8%, even with 500 connections and a 100Mbit circuit (on a 4-core server with 8GB ram), so that would imply that the hardware is not an issue
Any tips/advice would be greatly appreciated!
- 2019-03-25T14:38:38+00:00
Janos SUTO repo owner
What piler version do you have? 80k messages per day should be no problem. 650 opened connections from Exchange online is crazy, and if it's under your administration, then it should have some sort of flow control.

Btw. I don't think there's a (built in) limitation for max_connections settings. The default 64 is usually sufficient. Anyway, I'll compile some sort of crash test to handle more connections than $max_connections and see if it recovers later when the load drops.
- 2019-03-26T10:28:17+00:00
Janos SUTO repo owner
- assigned issue to
  
  Janos SUTO
- 2019-03-26T10:28:30+00:00
Tom Collins reporter
Thank you for the reply, Janos! :) I'm running 1.3.4.

We use Office 365, so have no control over journalling flow. During normal conditions, it's fine, but if piler is down, and the journal queue's, it seems to go crazy. From looking at the maillog, it looks like they distribute the outbound connections over 1000's of different hosts (each connection is from a different host) during this situation.

With max_connections set to 32, it has been working fine all night. The soft bounce queue from exchange has emptied, and we're now sitting around 3-10 simultaneous connections, and 4500 messages per hour.

My concern is, it was working fine with the default setting of 64, for around one week, then it just stopped. I don't believe there was a large amount of mail traffic coming into it during this period, and my real concern was that I could not find anything in the logs pertaining to a failure.

What I might do in production, is sit a couple of postfix boxes in front of piler, load balance the inbound connections between them, and set a large soft bounce queue on each of them (a week), and then limit their outbound connections to match piler.

I'm sure none of this is necessary. Maybe I did something wrong? Hopefully I'll stumble up on the problem. Fantastic product, by the way. Our company cannot wait to start using Piler in production! :) - If you need more information on my sandbox environment, please don't hesitate to ask!
- 2019-03-26T12:43:18+00:00
Janos SUTO repo owner
I think you did everything right. It's just odd that 32 is fine, but 64 is not for max_connections. Anyway, no matter of the exact value, piler-smtp should handle properly if all connections are occupied, and it should recover if the spike in mail volume goes back to normal.
- 2019-03-27T09:44:08+00:00
Tom Collins reporter
Indeed - I think we can probably close this issue off for now.

What I did yesterday afternoon, was stand up the Postfix box I was talking about, and set it's maximum outbound connections to Piler to 32 (which matched my piler configuration). I then killed the piler process last night, and allowed Postfix to retain the emails in a soft bounce state. I came back in this morning, re-enabled the Piler process, and Postfix sent across 40,000 emails over the course of the morning (it's set to 5 minute re-deliveries).

Absolutely no issues. TPS on the Piler server shot up to 900, CPU remained less than 8%. No SMTP errors, everything working well. I am going to put it down to maybe O365 flooding my server. As I say, this is a new box I stood up, so maybe there was a misconfiguration on my old one. I will now leave this running for a month, and re-open this incident if it happens again (maybe you can remote on and see it first hand, if this happens again).

Before we close, can I quickly ask you a couple of non-related questions if that's okay?
- I estimate that my archive will be 30TB after 3 years (3 years will be our maximum retention period). Do you have any specific MariaDB settings you would implement to support this size and traffic, or are the defaults adequate?
- My plan is to have a piler server in Asia Pacific, and one in the UK. I will then journal to both of them simultaneously. I was going to set up an autonomous process to perform a pilerexport each day (from each server), copy the files over to the partner server, and then perform a pilerimport. Will piler simply ignore duplicates if I do this, and only import delta emails (if say we had a circuit connection in one of the colo's for a day)? This is my (what I hope) clever way of creating a matching set of data, in case of failure.
Once again, thank you so much for your help! I am writing a medium and linkedin article about the benefits of Piler, in relation to cost savings against Mimecast!
- 2019-03-27T10:35:53+00:00
Janos SUTO repo owner
Well, I'm not sure about the exact mysql variable settings you need for a 30 TB archive, however these are some starter values for a small archive.
```
innodb_buffer_pool_size = 256M
innodb_flush_log_at_trx_commit=1
innodb_log_buffer_size=64M
innodb_log_file_size=64M
innodb_read_io_threads=4
innodb_write_io_threads=4
innodb_log_files_in_group=2
innodb_file_per_table
```
I suggest to check out the mariadb docs how to increase these values (in a meaningful way). I also recommend to keep tabs on the mysql load. When mysql has too little resources available, then the sphinx delta index time increases considerably. I also suggest to start a new main index (main2, later main3, ....) when the main1.* files grow large. (I know, 'large' is not that specific in terms of GBs, however it's difficult to determine without knowing your hw specs and the actual performance of your archive).

Your disaster recovery plan looks fine, and pilerimport will discard already archived emails, no problem. However, if you send the given email to both location (not sure if I understood you correctly), then no need for the export - import approach.

When you are done with the linkedin article, be sure to send me (either here or even on linkedin), I'd like to read it.
- 2019-03-27T16:18:51+00:00
Tom Collins reporter
Hi Janos - Unfortunately the new system came to a grinding halt again at the weekend. Worryingly, this was a fresh build, on a completely different server.

Luckily, my postfix server managed to cache the entire weekend's mail, so I haven't lost anything.

The behaviour this time seems slightly different; rather than a telnet on 25 timing out, this time it established, but it would not take any commands (including quit).

Here's some screen shots and a timeline, so you can get an idea of what's going on.

When I came in this morning, I saw the Piler: ERROR message. I checked my postfix server, and it had stacked up all my weekend messags, with the error "timed out while receiving the initial server greeting"

I SSH'd on to the piler server, checked that everything was running, checked the logs. The maillog (just like previously) looked fine, until the messages simply stopped;

I tried telnetting to it on 25, to see what was going on, and it opened the transmission, but would not respond to any commands (I had to close my ssh connection)

Restarting the daemon's didn't do anything. I had to restart the VM, and then it came back up.

Once again, I can't find any error in any log...the only indication that something went wrong, was the maillog (as it simply stops processing mail).

At this point, I'm at a loss. Do you have any ideas?
- 2019-04-01T08:37:46+00:00
Janos SUTO repo owner
It needs further investigation. I'd like you to turn on verbose logging (verbose=5), and when it freezes run lsof to figure out what sockets, descriptors, handles piler is using. However it's unusual that after killing all piler processes, then starting piler again doesn't help a bit.

If it's a new installation, and if it's possible to see what's going on, I'd like to get ssh access to the host to see. If you agree, let's discuss the details in a different channel, email or skype (janos.suto).
- 2019-04-01T19:52:22+00:00
Tom Collins reporter
Thank you, Janos. I have enabled verbose logging, and I will contact you when/if it happens again.
- 2019-04-02T14:31:35+00:00
Janos SUTO repo owner
A month passed, and still working fine?
- 2019-05-11T18:34:57+00:00
Janos SUTO repo owner
- changed status to closed
No news is good news.
- 2019-06-13T09:47:34+00:00
Log in to comment

Assignee: Janos SUTO

Type: bug

Priority: major

Status: closed

Votes: 0

Watchers: 1