Piler Segfaulting

Issue #206 resolved
Travis Edgar created an issue

I can get piler running with out issue for a couple days or so. We send about 40,000 messages a day at it. I have been going through my logs and I am finding segfaults "sprinkled" inside them. I believe that these segfaults are specific messages that piler can not deal with for one reason or another. This is not the end of the world, however it does cause me a little concern. Example below...

Dec 12 20:11:43 awsemailarchive01 kernel: [3860058.901279] piler[30857]: segfault at 0 ip 00007fc70aca5566 sp 00007fff917e3228 error 4 in libc-2.15.so[7fc70ab73000+1b5000]
Dec 12 20:12:39 awsemailarchive01 kernel: [3860115.671485] piler[31874]: segfault at 0 ip 00007fc70aca5566 sp 00007fff917e3228 error 4 in libc-2.15.so[7fc70ab73000+1b5000]
Dec 12 20:17:23 awsemailarchive01 kernel: [3860399.613671] piler[31897]: segfault at 0 ip 00007fc70aca55c7 sp 00007fff917e3228 error 4 in libc-2.15.so[7fc70ab73000+1b5000]
Dec 12 20:43:38 awsemailarchive01 kernel: [3861974.688782] piler[32194]: segfault at 0 ip 00007fc70aca5566 sp 00007fff917e3228 error 4 in libc-2.15.so[7fc70ab73000+1b5000]
Dec 12 21:16:46 awsemailarchive01 kernel: [3863962.158223] piler[32438]: segfault at 316e6974616c ip 00007fc70aca5566 sp 00007fff917e3228 error 4 in libc-2.15.so[7fc70ab73000+1b5000]

Around the same time piler indicates via the web-gui there is a error "piler: ERROR".

When this error is in play I can still manually perform a telnet session to piler, and everything works. However my mail server (not piler) sending to piler has it's defer queues start to fill up. To fix this a restart of the piler daemons needs to happen.

1.) Is there a way to set a custom location for piler logs?

2.) What could be the cause of the segfaults?

Comments (10)

  1. Janos SUTO repo owner

    Piler uses the syslog facility, so you may redirect it to a custom location by changing your syslog config. The segfault may be a programming error, not sure.

    If this is 0.1.24, then please try the latest master branch (https://bitbucket.org/jsuto/piler/get/master.tar.gz). It's nothing serious just make sure to apply util/db-upgrade-0.1.24-vs-0.1.25.sql

    Let me know if it helps.

  2. Travis Edgar reporter

    Sorry about the long time no response.

    I have not upgraded to the latest master branch since I have had this install working before. My previous install and this one were/are both installed in AWS on EC2 machine, configured by Puppet, so I am confused why this build does not work. I would be willing to bet that a package that I am not directly controlling is giving me issues.

    Further investigation leads me to believe there is no issue with piler itself. When my archive machine goes down, I still have multiple piler processes running. However I also have a variable amount of pdftotext processes running. The pdftotext processes bind to port 25, and this interferes with the piler processes from binding to port 25, which knocks out my machine.

    I strace a random amount of the pdftotext processes and they all seem to be in a waiting state, not really doing anything.

    Run times for these processes seem to be 20mins +.

    Anything info you may have would be greatly appreciated.

    Cheers.

  3. Janos SUTO repo owner

    No problem. I believe that pdftotext is not a networked program, so I don't think it occupies port 25. It's more likely that some pdftotext processes hang and thus the calling piler processes as well. It might be a good idea to introduce some sort of watchdog / alarm thing to kill pdftotext after 5-10 seconds.

  4. Travis Edgar reporter

    You are correct, pdftotext is not a network program, the parent process (which I think is called by piler) is what takes over port 25

    piler    11390  0.0  0.0   4404   592 ?        S    19:00   0:00 sh -c /usr/bin//pdftotext -enc UTF-8 4000000052d356b8189fd34c000ef4f1b306.a1.bin -
    piler    11391  3.0  0.8  58244 14068 ?        R    19:00   4:46 /usr/bin//pdftotext -enc UTF-8 4000000052d356b8189fd34c000ef4f1b306.a1.bin -
    

    Is there any way to control how many pdftotext processes piler creates?

    In your experience "how long" should a pdftotext process need to take to complete the conversion?

    The watchdog idea is a good one, however I am concerned that killing the pdftotext processes will make searches of the content within those pdf's in the future unreliable. What are you thoughts on this?

    Once again thanks.

  5. Janos SUTO repo owner

    Try the following. Stop the piler daemon, and kill all stale pdftotext processes as well.

    Then edit src/extract.c in the piler source directory, and locate the following line:

    if(strcmp(type, "pdf") == 0) snprintf(cmd, sizeof(cmd)-1, "%s -enc UTF-8 %s -", HAVE_PDFTOTEXT, filename);

    It's in the 225th line for me. Then replace it with this line:

    if(strcmp(type, "pdf") == 0) snprintf(cmd, sizeof(cmd)-1, "timeout 10 %s -enc UTF-8 %s -", HAVE_PDFTOTEXT, filename);

    A finally, recompile the piler binaries, then start the piler daemon again, and let's see if it helps. The timeout 10 command makes sure that pdftotext is terminated after 10 seconds in case of it hangs.

  6. Travis Edgar reporter

    This has fixed our issues. Thanks.

    I am now experiencing segfault issues with Sphinx, I assume I should work with those good people to get it sorted out?

    Cheers.

  7. Janos SUTO repo owner

    OK, then I push this into the source tree. You may specify --with-plugin-timeout=10 to give 10 seconds to the external helper utility to finish.

  8. Janos SUTO repo owner

    It's a good idea to ask help from the sphinx community. Also you may consider upgrading sphinx to 2.1.5, it was released ~2 weeks ago.

  9. Log in to comment