High CPU load on piler server

Issue #408 closed
Michal Mistina created an issue

Hello.

The customer is experiencing high CPU load on the piler server. The piler process consumes 100% of the CPU. Only restarting that service helps. The messages are processed and are visible inside WebUI however the MTA device which sends the messages to the piler builds very high queue (last time it ended up on 4000 e-mails). It looks like as the piler does not process those e-mails fast enough.

The hardware should be OK: 2xCPU, 2GB RAM

The customer says, it has most current version of the piler. I am waiting for the output of the piler -v command.

Where else should we look?

Comments (27)

  1. Janos SUTO repo owner

    I think perhaps an attachment causing trouble. If piler -v doesn't have the --with-plugin-timeout=xx option, then recompile piler adding this options as well.

  2. Michal Mistina reporter

    Janos, where can I check the latest build number of the piler on the bit bucket web page? The customer sent the output of the piler -v command:

    piler -v 
    piler 1.1.0, build 884, Janos SUTO 
    
    Build Date: Mon Sep 8 14:34:48 CEST 2014 
    ldd version: ldd (GNU libc) 2.12 
    gcc version: gcc version 4.4.7 20120313 (Red Hat 4.4.7-4) (GCC) 
    Configure command: ./configure --localstatedir=/var --with-database=mysql --enable-starttls --enable-tcpwrappers --enable-memcached
    

    Can you tell me if this is the very last build or is it too old? Should we try to recompile a newer version?

    Thank you for suggestion. We will definitely try that.

    Michal

  3. Janos SUTO repo owner

    The bitbucket site doesn't tell the build number. Perhaps some tagging would help - maybe in the future. However you can compile a newer version if you want, it's a matter of configure ....; make ; make install. The point is to add -with-plugin-timeout=20 to the configure options. Also be sure to monitor the memory usage.

  4. Michal Mistina reporter

    With option with-plugin-timeout doesn't piler start to drop some of the attachements?

  5. Michal Mistina reporter

    Before configure; make; make install; is it necessary also to do "make uninstall"? I am not familiar with the answer for this question because I do make uninstall all the time being.

  6. Janos SUTO repo owner

    The with-plugin-timeout is a timeout value for the external helpers, they have so many seconds to finish extracting text. It's purpose is to prevent the helpers to stuck in a loop in case of a damaged attachment. If the specified seconds are not enough, the attachment is not lost, it's still archived, just its content is not indexed.

    Don't do a make uninstall, just stop piler, then do a make install, finally start it. However it's always a good idea to backup your piler related configs, including piler.conf and piler.key.

  7. Michal Mistina reporter
    • changed status to open

    Hi Janos. The customer followed the steps with recompilation but it didn't help. The CPU is still at the top. I have this news from yesterday.

  8. Janos SUTO repo owner

    If it's centos, then add the following to config-site.php:

    $config['CPU_USAGE_COMMAND'] = "LC_ALL=C mpstat | tail -1 | awk '{print $12}'";
    
  9. Michal Mistina reporter

    Yes, it's a CentOS. We will try. What does this change do? Where it can be observed after applied?

  10. Janos SUTO repo owner

    This command is executed to get the actual system load. You may see it on the health page.

  11. Michal Mistina reporter

    The issue is not the displaying part of the piler, but CPU usage itself. It consumes from 50 to 100% CPU.

    ps aux | grep piler
    piler 1668 0.0 0.0 18208 668 ? S Aug20 0:00 searchd
    piler 1669 0.0 11.4 372060 339888 ? Sl Aug20 35:22 searchd
    piler 3432 0.0 0.0 20208 884 ? Ss Sep11 0:06 /usr/local/sbin/piler -d
    piler 3433 50.2 0.1 21476 3144 ? R Sep11 4409:53 /usr/local/sbin/piler -d
    piler 3435 47.2 0.1 21444 3124 ? R Sep11 4143:42 /usr/local/sbin/piler -d
    piler 3437 47.1 0.1 21644 3348 ? R Sep11 4135:29 /usr/local/sbin/piler -d
    piler 3438 50.1 0.1 21584 3252 ? R Sep11 4399:22 /usr/local/sbin/piler -d
    piler 8616 0.2 0.1 21496 3208 ? S 14:54 1:04 /usr/local/sbin/piler -d
    piler 8997 0.2 0.1 21548 3148 ? S 15:03 1:00 /usr/local/sbin/piler -d
    piler 9006 0.2 0.1 21356 2988 ? S 15:03 0:52 /usr/local/sbin/piler -d
    piler 9306 0.2 0.1 21824 3528 ? S 15:11 0:49 /usr/local/sbin/piler -d
    piler 9709 0.2 0.1 21760 3432 ? S 15:20 0:54 /usr/local/sbin/piler -d
    piler 10101 0.2 0.0 21368 2944 ? S 15:31 0:53 /usr/local/sbin/piler -d
    

    This outcomes to filling the queue on the smtp device which sends the mails to piler. Only restart of the piler service helps to keep it down for a while. I don't have the information when after restart of the service it starts adding up CPU consumption again.

  12. Janos SUTO repo owner

    Ah, sorry, I overlooked the problem. Have they applied the extra configure option I suggested above? When the cpu load is high, are the piler children processes processing emails? Check out the maillog with tail -f. Also set verbosity=5, then restart piler, thus you'll get a detailed info what's happening.

  13. Michal Mistina reporter

    Yes, the recompilation solution with parameter -with-plugin-timeout=20 was applied, but didn't help. Maillog didn't show anything particular except "..duplicate message.." messages, but we will try to use verbose=5 option as you suggested. How can I determine if the messages are processed by other children processes? Where can be found the information which messages are processed by which children processes?

  14. Janos SUTO repo owner

    Each child process is assigned its own unique process id. A typical piler log entry looks like <date> piler[12345] ...., where 12345 is the process id (pid). Also make sure the number of piler worker (child) processes match the number of parallel deliveries your MTA tries. Eg. if the MTA tries to open 20 parallel connection to piler, and piler has only 10 worker processes, then that's a bottleneck, and the MTA is likely to develop a queue of deferred emails.

  15. Michal Mistina reporter

    Please leave this case open longer than usual. The customer responds really slow on this :-/

    Thank you

  16. tinyema

    I had the same problem (Piler 1.1.0 build 884 OVA)... pdftotext stuck and I had high cpu usage & piler smtp error on the health's page. I have followed your instructions and I have recompiled with the option --with-plugin-timeout=20 Now an hour has passed and the problem has not occurred more :-) Thanks!

  17. Michal Mistina reporter

    Don't know the status from the customer...last 2 times I tried to contacted him, he told me, that he did a workaround by restarting piler process once in a day by cron job. Not nice, but currently it is resolved.

  18. Michal Mistina reporter
    • changed status to open

    Hello again. Finally I have a some log excerpts. The first excerpt shows the log in the time when the CPU load started to rise.

    Nov  4 05:16:09 mailarchive piler[31141]: connection from X.X.X.X
    Nov  4 05:16:09 mailarchive piler[31141]: 40000000545853133580ffa4004291a7d5cc: sent: 220 mailarchive.domain.sk ESMTP#015
    Nov  4 05:16:09 mailarchive piler[31141]: 40000000545853133580ffa4004291a7d5cc: got: EHLO mailserver.domain.sk
    Nov  4 05:16:09 mailarchive piler[31141]: 40000000545853133580ffa4004291a7d5cc: sent: 250-mailarchive.domain.sk#015#012250-PIPELINING#015#012250-SIZE#015#012250 8BITMIME#015
    Nov  4 05:16:09 mailarchive piler[31141]: 40000000545853133580ffa4004291a7d5cc: got: MAIL FROM:<antispam@domain.sk> SIZE=37522
    Nov  4 05:16:09 mailarchive piler[31141]: 40000000545853133580ffa4004291a7d5cc: got: RCPT TO:<archiver@mailarchive.domain.sk>
    Nov  4 05:16:09 mailarchive piler[31141]: 40000000545853133580ffa4004291a7d5cc: got: DATA
    Nov  4 05:16:09 mailarchive piler[31141]: 40000000545853133580ffa4004291a7d5cc: sent: 250 Ok#015#012250 Ok#015#012354 Send mail data; end it with <CRLF>.<CRLF>#015
    Nov  4 05:16:09 mailarchive piler[31141]: 40000000545853133580ffa4004291a7d5cc: period found
    Nov  4 05:16:09 mailarchive piler[31141]: 40000000545853133580ffa4004291a7d5cc: got: (.)
    Nov  4 05:16:09 mailarchive piler[31141]: 40000000545853133580ffa4004291a7d5cc: attachment list: i:1, name=*V000223.zip*, type: *application/octet-stream*, size: 34082, int.name: 40000000545853133580ffa4004291a7d5cc.a1, digest: 5f3e2ad1b7eb9e0556698d3ec753b704ab7bae4b4550b604d08a97fa748a5c92
    Nov  4 05:35:01 mailarchive piler: INDEXER INFO: indexing delta1 started
    Nov  4 05:35:02 mailarchive piler: INDEXER INFO: indexing delta1 finished
    Nov  4 05:35:07 mailarchive piler: INDEXER INFO: merging delta to dailydelta started
    Nov  4 05:35:08 mailarchive piler: INDEXER INFO: merging delta to dailydelta finished
    Nov  4 06:05:02 mailarchive piler: INDEXER INFO: indexing delta1 started
    Nov  4 06:05:02 mailarchive piler: INDEXER INFO: indexing delta1 finished
    Nov  4 06:05:07 mailarchive piler: INDEXER INFO: merging delta to dailydelta started
    Nov  4 06:05:08 mailarchive piler: INDEXER INFO: merging delta to dailydelta finished
    Nov  4 06:35:01 mailarchive piler: INDEXER INFO: indexing delta1 started
    Nov  4 06:35:01 mailarchive piler: INDEXER INFO: indexing delta1 finished
    

    Right after the piler service is restarted it shows some error messages inside the second log excerpt:

    Nov  5 00:32:52 mailarchive piler[22661]: connection from X.X.X.X
    Nov  5 00:32:52 mailarchive piler[22661]: 400000005459622e37abda9c00a1a7053aed: sent: 220 mailarchive.domain.sk ESMTP#015
    Nov  5 00:32:52 mailarchive piler[22661]: 400000005459622e37abda9c00a1a7053aed: got: EHLO mailserver.domain.sk
    Nov  5 00:32:52 mailarchive piler[22661]: 400000005459622e37abda9c00a1a7053aed: sent: 250-mailarchive.domain.sk#015#012250-PIPELINING#015#012250-SIZE#015#012250 8BITMIME#015
    Nov  5 00:32:52 mailarchive piler[22661]: 400000005459622e37abda9c00a1a7053aed: got: MAIL FROM:<antispam@domain.sk> SIZE=1180
    Nov  5 00:32:52 mailarchive piler[22661]: 400000005459622e37abda9c00a1a7053aed: got: RCPT TO:<archiver@mailarchive.domain.sk>
    Nov  5 00:32:52 mailarchive piler[22661]: 400000005459622e37abda9c00a1a7053aed: got: DATA
    Nov  5 00:32:52 mailarchive piler[22661]: ERR: opening a tempfile: 400000005459622e37abda9c00a1a7053aed
    Nov  5 00:32:52 mailarchive piler[22661]: 400000005459622e37abda9c00a1a7053aed: sent: 250 Ok#015#012250 Ok#015#012451 Error in processing, try again later#015
    Nov  5 00:32:52 mailarchive piler[22661]: 400000005459622e37abda9c00a1a7053aed: got: RSET
    Nov  5 00:32:52 mailarchive piler[22661]: 400000005459622e37abda9c00a1a7053aed: removed
    Nov  5 00:32:52 mailarchive piler[22661]: 400000005459622e38c6437c004bd04c39bc: got: QUIT
    Nov  5 00:32:52 mailarchive piler[22661]: 400000005459622e38c6437c004bd04c39bc: removed
    Nov  5 00:32:52 mailarchive piler[22661]: 400000005459622e38c6437c004bd04c39bc: sent: 250 Ok#015#012221 mailarchive.domain.sk Goodbye#015
    Nov  5 00:32:52 mailarchive piler[22661]: processed 0 messages
    Nov  5 00:32:53 mailarchive piler[22662]: connection from X.X.X.X
    Nov  5 00:32:53 mailarchive piler[22662]: 400000005459622f00b8beec00208ecba87b: sent: 220 mailarchive.domain.sk ESMTP#015
    Nov  5 00:32:53 mailarchive piler[22662]: 400000005459622f00b8beec00208ecba87b: got: EHLO mailserver.domain.sk
    Nov  5 00:32:53 mailarchive piler[22662]: 400000005459622f00b8beec00208ecba87b: sent: 250-mailarchive.domain.sk#015#012250-PIPELINING#015#012250-SIZE#015#012250 8BITMIME#015
    Nov  5 00:32:53 mailarchive piler[22662]: 400000005459622f00b8beec00208ecba87b: got: MAIL FROM:<antispam@domain.sk> SIZE=1213
    Nov  5 00:32:53 mailarchive piler[22662]: 400000005459622f00b8beec00208ecba87b: got: RCPT TO:<archiver@mailarchive.domain.sk>
    Nov  5 00:32:53 mailarchive piler[22662]: 400000005459622f00b8beec00208ecba87b: got: DATA
    Nov  5 00:32:53 mailarchive piler[22662]: ERR: opening a tempfile: 400000005459622f00b8beec00208ecba87b
    Nov  5 00:32:53 mailarchive piler[22662]: 400000005459622f00b8beec00208ecba87b: sent: 250 Ok#015#012250 Ok#015#012451 Error in processing, try again later#015
    Nov  5 00:32:53 mailarchive piler[22662]: 400000005459622f00b8beec00208ecba87b: got: RSET
    Nov  5 00:32:53 mailarchive piler[22662]: 400000005459622f00b8beec00208ecba87b: removed
    Nov  5 00:32:53 mailarchive piler[22662]: 400000005459622f021e984c001e4b7bc3db: got: QUIT
    Nov  5 00:32:53 mailarchive piler[22662]: 400000005459622f021e984c001e4b7bc3db: removed
    Nov  5 00:32:53 mailarchive piler[22662]: 400000005459622f021e984c001e4b7bc3db: sent: 250 Ok#015#012221 mailarchive.domain.sk Goodbye#015
    Nov  5 00:32:53 mailarchive piler[22662]: processed 0 messages
    
  19. Janos SUTO repo owner

    The first log snippet shows that piler has a hard time processing the attachment. The second snippet indicates a serious issue: piler can't write a temp file. Please check if there's enough space under /var/piler/tmp. Perhaps a complete reboot may be a good idea.

  20. Michal Mistina reporter

    You are correct. Only reboot helped to get rid of the error message. This is not a solution, but I will try to find out what's wrong with the tmp directory and we will see then if it helps completely.

  21. Michal Mistina reporter

    Janos, I am closing this issue. Customer resolved it by his workaround of periodically restarting piler process with cron and I have no other response from him. Thank you for your help. If there is something new about it I just reopen the issue.

  22. Log in to comment