Not able to find mails in piler webui

Issue #950 closed
Rajesh created an issue

HI,

We have installed piler 1.3.4, sphinxsearch 2.2.9 on UBuntu 16.04.

We are facing following issues , we are receiving mails to the piler server from the smtp gateway and able to see the mail logs. But some mails are missing in the webui.

When i did mysql query in metadata file , i am not able to search particular mails which are there in mail.logs of piler server.

tailf /var/log/mail.log|grep glraj

Dec 5 11:23:14 piler piler-smtp[16596]: received: 8BK9DWBULJCOB7CN, from=glraj@bel.co.in, size=2479, client=172.20.50.9

Kindly guide us for further troubleshooting.

Regards, Rajesh G L

Comments (27)

  1. Janos SUTO repo owner

    For starters check if 8BK9DWBULJCOB7CN was indeed processed by piler daemon. If yes, then verify that its status is stored.

  2. Rajesh reporter
    1. piler-smtp is processing and receiving for all the messages.

    2. piler-deamon is processing and storing for few messages as shown below :

    Dec 11 15:22:58 piler piler-smtp[6212]: received: 1UW2OI1XYQLIDXB7, from=biswabhusan@bel.co.in, size=110446, client=172.20.50.9 Dec 11 15:22:58 piler piler[6094]: 49/1UW2OI1XYQLIDXB7: 400000005c0f890423baf32400b323fa2c15, size=110446/68376, attachments=1, reference=576815f257327c57a30cd21ff14ccd97@bel.co.in, message-id=9c9cadaba2f78a4a768c16694198e733@bel.co.in, retention=2557, folder=0, delay=0.0458, status=stored

    3.Most of the messages, the piler-demaon is not processing, the output for one of such mail is shown below:

    the mailDec 11 14:27:27 piler piler-smtp[32700]: processing command: MAIL FROM:mailsupport@bel.co.in SIZE=1653#015#012 Dec 11 14:27:27 piler piler-smtp[32700]: received: AN911ZNGOV13QLGI, from=mailsupport@bel.co.in, size=1650, client=172.20.50.9

  3. Janos SUTO repo owner

    I'm confused. In which dir (somewhere under in /va/piler) is AN911ZNGOV13QLGI file ? If piler-smtp received it, then it should put it to a subdir under /var/piler/tmp. If there was a processing error then it should put it to /var/piler/error. However in all cases piler daemon does syslog whatever has happened. Btw. please use the bitbucket formatting before submitting any logs, because it's unreadable otherwise.

  4. Rajesh reporter

    The issue was with mysql version , we were using mysql 5.7 which was creating issues with storing of mails.

    We removed mysql 5.7 and installed mariadb version 10.03. Now it is working fine without issues.

    Thanks for the support.

  5. Rajesh reporter
    • changed status to open

    Hi,

    The issue reappeared again after working for 3 months without any issue. The mails are able to receive and are not getting stored in mysql metadata. once if we restart piler it starts storing for few minutes then again it stops storing.
    Mariadb ver: 10.0.38 piler :1.3.4

    kindly guide us for troubleshooting

    With Regards, Rajesh G L

  6. Janos SUTO repo owner

    I need the relevant logs: mail log, webserver errot logs, etc. Make sure to format them.

  7. Rajesh reporter

    Hi Janos, Thanks for the reply , I have attached the logs. The issue started with network disruption on 26/Mar/2019 , It is observed yesterday around 400000 new mails (on an average 35000 mails per day) are stored in store directory which is not visible in GUI. I have done reindexing as per the document. Only partial mails are stored and able to view in GUI. If i start rc.piler manually it stores some mails and storing of mails stops suddenly, but the rc.piler service is active.

    Is there any other way to store the remaining 300000 mails from store directory to the mysql metadata.

    Due to this issue,the new mails are also not visible in GUI

  8. Janos SUTO repo owner

    Based on the logs somebody stopped the mysql daemon. Make sure it's always available. Also check if the indexer job runs properly. Try to run it manually as user piler, it should process without an error and in a timely fashion.

  9. Rajesh reporter

    Hi Janos,

    Thanks for reply,

    The issue started with network disruption , we observed around, 800000 new mails (on an average 35000 mails per day) were stored in store directory which was not visible in GUI. we stopped SMTP relay server to mailpiler and restarted piler services , searchd and mysql services . indexing started and mails were getting stored only for 5 min, then after restarting piler services it was started again, so we added to cron to start piler services every 5 min. so it took allmost 3 days to store 800000 mails in mysql metadata.

    After restoration of mails ,we started relaying the mails , now the new mails are getting stored without any issues .

    we have fine tuned in mysql server as per piler documentation , we increased "innodb_buffer_pool_size = 5GB " (For RAM size of 32 GB) in mariadb server configuration,

    what is the procedure to be followed to handle both mails in stored directory and new mails during network disruption ?

    Is there any mechanism to monitor all mails getting stored in mysql metadata.

  10. Janos SUTO repo owner

    I'm confused. Based on your numbers it took almost 3 weeks for you to discover that something is not right with the archive. I recommend you to upgrade your monitoring.

    For starters check if piler is responding on port 25. Then check if all piler related processes are running, including the cron jobs, and even the sphinx search daemon and mysql as well. Then it's worth to check the mail log, too. If piler encounters any issue it syslogs it. Search for "ERROR" case insensitively (perhaps I wasn't that consistent to use upper case everywhere).

    Then recent builds of piler feature the error directory (/var/piler/error). If there's any problem during the processing of an email, piler syslogs it, and moves the problematic email to /var/piler/error for later analysis. Be sure to check this directory as well.

    Finally check the number of rows in sph_index table (select count(*) from sph_index). If it grows only and never shrinks, then it means there's an indexing problem. For a hint check how long the delta indexer runs. If it's unusually slow and takes lots of time there's a problem, probably insufficient mysql resources.

    And if piler worked for 5 minutes only, then I wouldn't think a simple restart is the solution (at least not too often), rather you should dig into the problem what's going on. Again the logs are your primary target.

  11. Rajesh reporter
    • changed status to open

    Hi Janos,

    Sorry for delayed reply , We are trying to build monitoring tools to watch piler services and smtp .

    If we get some suggestion to build this tools from you it help us a lot.

    The problem was repeating frequently , the observation from our end is whenever there is more mails received in the piler , it is not storing in the mysql metadata.

    After running this command /etc/init.d/rc.piler stop && ps -ef | grep piler | awk '{print $2}' xargs kill -9 | /etc/init.d/rc.piler start

    it starts storing again for an average of 2 minutes and stops storing again ,hence we are executing the above command every 3 minutes (through cron job) to store all the mails which are received in the piler whenever the issue is observed . By this method all them mails gets restored.

    We checked all the logs but not getting any clue.

    We receive on an average of 60000 to 70000 mails per day.

    Kindly guide us to debug this issue in depth

    Regards,

    Rajesh G L

  12. Janos SUTO repo owner

    I wouldn’t build a monitoring system, instead I’d use an existing one, eg. nagios, zabbix, sensu, telegraf, prometheus, whatever your favourite one is.

    Mysql is a critical piece of the components. Check if it has enough resources.

    Then if piler encounters an error, then it syslogs the issue. So don’t just restart piler every 2 minutes, rather check the logs what’s happening. Also I’d like to know what “more mails” means. How many emails do you get at the same time when you face problems? What are your piler settings? Show me the piler related running jobs from ps axuw output, and df -h output.

    Anyway, 80k messages/day should be piece of cake for piler. If nothing helps I’d like to see this host myself via ssh.

  13. Rajesh reporter

    Hi Janos,

    Please find the outputs

    /home/beladmin# df -h
    Filesystem Size Used Avail Use% Mounted on
    udev 16G 0 16G 0% /dev
    tmpfs 3.2G 338M 2.8G 11% /run
    /dev/sda2 246G 42G 193G 18% /
    tmpfs 16G 0 16G 0% /dev/shm
    tmpfs 5.0M 0 5.0M 0% /run/lock
    tmpfs 16G 0 16G 0% /sys/fs/cgroup
    /dev/mapper/mailbkp-belmail 6.0T 3.3T 2.4T 58% /var/belmail
    /dev/mapper/syncpiler-mailpiler 7.8T 678G 6.7T 9% /var/piler
    tmpfs 100K 0 100K 0% /run/lxcfs/controllers
    tmpfs 3.2G 0 3.2G 0% /run/user/1000
    tmpfs 3.2G 0 3.2G 0% /run/user/0

    ps axuw|grep piler

    piler 13175 0.1 0.0 28420 2660 ? Ss May15 1:27 /usr/local/sbin/piler-smtp -d
    piler 13177 0.0 0.0 37484 540 ? Ss May15 0:00 /usr/local/sbin/piler -d
    piler 13178 0.6 0.1 90792 55044 ? S May15

    5:46 /usr/local/sbin/piler -d
    piler 13179 0.6 0.1 76836 41024 ? S May15 5:40 /usr/local/sbin/piler -d
    piler 27442 0.0 0.0 113288 1524 ? S May15 0:00 searchd --config /usr/local/etc/piler/sphinx.conf
    piler 27443 0.1 2.6 1096932 858640 ? Sl May15 1:28 searchd --config /usr/local/etc/piler/sphinx.conf
    root 29972 0.0 0.0 12944 960 pts/0 S+ 09:06 0:00 grep --color=auto piler

    root@piler:/home/beladmin# ps axuw|grep searchd
    piler 27442 0.0 0.0 113288 1524 ? S May15 0:00 searchd --config /usr/local/etc/piler/sphinx.conf
    piler 27443 0.1 2.6 1096932 858640 ? Sl May15 1:28 searchd --config /usr/local/etc/piler/sphinx.conf
    root 29989 0.0 0.0 12944 980 pts/0 S+ 09:06 0:00 grep --color=auto searchd

    ps axuw|grep mysql
    root 26954 0.0 0.0 19892 2924 ? S May15 0:00 /bin/bash /usr/bin/mysqld_safe
    mysql 27121 0.4 9.5 18615740 3134932 ? Sl May15 5:00 /usr/sbin/mysqld --basedir=/usr --datadir=/var/lib/mysql --plugin-dir=/usr/lib/mysql/plugin --user=mysql --log-error=/var/log/mysql/error.log --pid-file=/var/run/mysqld/mysqld.pid --socket=/var/run/mysqld/mysqld.sock --port=3306
    root 30065 0.0 0.0 12944 980 pts/0 S+ 09:07 0:00 grep --color=auto mysql

    These are the mysql settings made on our my.cnf

    bind-address = 127.0.0.1

    innodb_buffer_pool_size = 16G (50% of our total RAM size of 32GB)
    innodb_flush_log_at_trx_commit=1
    innodb_log_buffer_size=64M
    innodb_log_file_size=64M
    innodb_read_io_threads=4
    innodb_write_io_threads=4
    innodb_log_files_in_group=2

    innodb_file_per_table

    key_buffer_size = 16M
    max_allowed_packet = 16M
    thread_stack = 192K
    thread_cache_size = 8

    Can i share anydesk with you to login to our piler server.

    Regards,

    Rajesh G L

  14. Rajesh reporter

    Hi Janos,

    When we were trying to move the mails from /var/piler/tmp/0.old (which was created by you in previous remote session) to /var/piler/tmp/0/ directory , it was processing most of the mails but suddenly it starts creating large number of temporary files under /var/piler/tmp/ directory , these sample files shown below are filling up the /var/piler/tmp directory (up to 3 TB space) within span of hours.

    400000005cf674ec13aba6f400545fd13f70.a1
    400000005cf674ec13aba6f400545fd13f70.a2
    400000005cf674ec13aba6f400545fd13f70.a3
    400000005cf674ec13aba6f400545fd13f70.a4
    400000005cf674ec13aba6f400545fd13f70.a5
    400000005cf674ec13aba6f400545fd13f70.a6.bin
    400000005cf674ec13aba6f400545fd13f70.m

    =====================================================================

    When observed in /var/log/syslog , the following errors were found , it is creating multiple child pids for piler process.

    Jun 4 08:28:28 piler kernel: [5963303.163791] piler[8832]: segfault at 0 ip 00007f30bb63cad3 sp 00007ffe7a2b2938 error 4 in libc-2.23.so[7f30bb5b3000+1c0000]
    Jun 4 08:28:28 piler piler[8834]: forked a child (pid: 8834)
    Jun 4 08:28:28 piler piler[8834]: child (pid: 8834, serial: 0) started main() working on '0'
    Jun 4 08:28:28 piler kernel: [5963303.493789] piler[8834]: segfault at 0 ip 00007f30bb63cad3 sp 00007ffe7a2b2938 error 4 in libc-2.23.so[7f30bb5b3000+1c0000]
    Jun 4 08:28:29 piler piler[8836]: forked a child (pid: 8836)
    Jun 4 08:28:29 piler piler[8836]: child (pid: 8836, serial: 0) started main() working on '0'
    Jun 4 08:28:29 piler kernel: [5963303.816599] piler[8836]: segfault at 0 ip 00007f30bb63cad3 sp 00007ffe7a2b2938 error 4 in libc-2.23.so[7f30bb5b3000+1c0000]
    Jun 4 08:28:29 piler piler[8838]: forked a child (pid: 8838)
    Jun 4 08:28:29 piler piler[8838]: child (pid: 8838, serial: 0) started main() working on '0'
    Jun 4 08:28:29 piler kernel: [5963304.145533] piler[8838]: segfault at 0 ip 00007f30bb63cad3 sp 00007ffe7a2b2938 error 4 in libc-2.23.so[7f30bb5b3000+1c0000]
    Jun 4 08:28:29 piler piler[8840]: forked a child (pid: 8840)
    Jun 4 08:28:29 piler piler[8840]: child (pid: 8840, serial: 0) started main() working on '0'
    Jun 4 08:28:29 piler kernel: [5963304.471740] piler[8840]: segfault at 0 ip 00007f30bb63cad3 sp 00007ffe7a2b2938 error 4 in libc-2.23.so[7f30bb5b3000+1c0000]

    ===========================================================================================

    When we restart the piler processes , this issue is continued with same errors, to overcome this issue we deleted the /var/piler/tmp directory and recreated the /var/piler/tmp directory manually.

    After restarting the piler services , the process has come back to normal.

    One more observation is whenever the piler service is stopped ,the piler processes is still active , we have to manually kill all the piler processes and after multiple restart of piler service starts working .

    Can i share anydesk with you to login to our piler server whenever you are free.

    Regards,

    Rajesh

  15. Janos SUTO repo owner

    Kill all piler processes, make sure no piler nor piler-smtp processes are running. Then wait a minute, and start piler daemons. Also check the number of piler processes running. There should be 1 piler-smtp process, and n+1 piler processes where n is the max workers from piler.conf. Then let’s see if the queue clears.

  16. Rajesh reporter

    I killed all the piler processes and restarted the piler. It is working fine now.

    I changed the work processes from 2 to 4 in piler.conf and restarted the piler processes.

    Work processes has any impact on storing of mails ? will this have any impact for our issue.

  17. Janos SUTO repo owner

    Yes, the more workers you have the greater is the throughput of the system. Anyway make sure that you don’t attempt to start piler while it’s already running. Perhaps I’d create a lock for piler daemon as well to prevent such issues.

  18. Log in to comment