Attachment search issue

Issue #24 resolved
Dimple Mehta created an issue

Failed to search in attachment body using mailpiler.Any suggestion would be great.

Comments (31)

  1. Janos SUTO repo owner

    what version do you use (piler -v)? What kind of attachment do you have to index and search?

  2. Janos SUTO repo owner

    I meant, please show me the output of the "piler -v" command to see which version exactly you are running.

    Please also show me the contents of piler-config.h file.

  3. Dimple Mehta reporter

    Content of piler-config.h file /*

    • config.h, SJ
    • /
    1. ifndef _CONFIG_H
    2. define _CONFIG_H
    1. include <syslog.h>
    2. include "piler-config.h"
    3. include "params.h"
    1. define PROGNAME "piler"
    1. define VERSION "0.1.21"
    1. define BUILD 705
    1. define HOSTID "mailarchiver"
    1. define CONFIG_FILE CONFDIR "/piler.conf"
    2. define WORK_DIR DATADIR "/piler/tmp"
    3. define QUEUE_DIR DATADIR "/piler/store"
    1. define CLAMD_SOCKET "/tmp/clamd"
    1. define PIDFILE "/var/run/piler/piler.pid"
    2. define QUARANTINELEN 255
    3. define TIMEOUT 60
    4. define TIMEOUT_USEC 500000
    5. define SESSION_TIMEOUT 420
    6. define MAXBUFSIZE 8192
    7. define SMALLBUFSIZE 512
    8. define BIGBUFSIZE 131072
    9. define REALLYBIGBUFSIZE 524288
    10. define TINYBUFSIZE 128
    11. define MAXVAL 256
    12. define RANDOM_POOL "/dev/urandom"
    13. define RND_STR_LEN 36
    14. define BUFLEN 32
    15. define IPLEN 16+1
    16. define KEYLEN 56
    1. define CRLF "\n"
    1. define MEMCACHED_CLAPF_PREFIX "_piler:"
    2. define MAX_MEMCACHED_KEY_LEN 250
    1. define MEMCACHED_SUCCESS 0
    2. define MEMCACHED_FAILURE 1
    1. define MEMCACHED_COUNTERS_LAST_UPDATE MEMCACHED_CLAPF_PREFIX "counters_last_update"
    2. define MEMCACHED_MSGS_RCVD MEMCACHED_CLAPF_PREFIX "rcvd"
    3. define MEMCACHED_MSGS_VIRUS MEMCACHED_CLAPF_PREFIX "virus"
    4. define MEMCACHED_MSGS_DUPLICATE MEMCACHED_CLAPF_PREFIX "duplicate"
    5. define MEMCACHED_MSGS_IGNORE MEMCACHED_CLAPF_PREFIX "ignore"
    6. define MEMCACHED_MSGS_SIZE MEMCACHED_CLAPF_PREFIX "size"
    1. define LOG_PRIORITY LOG_INFO
    1. define _LOG_INFO 3
    2. define _LOG_DEBUG 5
    1. define MAX_RCPT_TO 128
    1. define MIN_WORD_LEN 3
    2. define MAX_WORD_LEN 25
    3. define MAX_TOKEN_LEN 4*MAX_WORD_LEN
    4. define DELIMITER ' '
    5. define BOUNDARY_LEN 255
    6. define MAX_ATTACHMENTS 16
    7. define MAX_ZIP_RECURSION_LEVEL 2

    /* SQL stuff */

    1. define SQL_SPHINX_TABLE "sph_index"
    2. define SQL_METADATA_TABLE "metadata"
    3. define SQL_ATTACHMENT_TABLE "attachment"
    4. define SQL_FOLDER_TABLE "folder"
    5. define SQL_RECIPIENT_TABLE "rcpt"
    6. define SQL_ARCHIVING_RULE_TABLE "archiving_rule"
    7. define SQL_RETENTION_RULE_TABLE "retention_rule"
    8. define SQL_COUNTER_TABLE "counter"
    9. define SQL_OPTION_TABLE "option"
    10. define SQL_MESSAGES_VIEW "v_messages"
    11. define SQL_ATTACHMENTS_VIEW "v_attachment"

    /* Error codes */

    1. define OK 0
    2. define ERR 1
    3. define ERR_EXISTS 2
    1. define AVIR_OK 0
    2. define AVIR_VIRUS 1
    1. define DIRECTION_INCOMING 0
    2. define DIRECTION_INTERNAL 1
    3. define DIRECTION_OUTGOING 2
    4. define DIRECTION_INTERNAL_AND_OUTGOING 3
    1. define WRITE_TO_STDOUT 0
    2. define WRITE_TO_BUFFER 1
    1. endif /* _CONFIG_H */
  4. Dimple Mehta reporter

    Output of piler-V command: piler 0.1.21, build 705, Janos SUTO <sj@acts.hu>

    Build Date: Mon Sep 17 15:10:26 CEST 2012 Configure command: ./configure --localstatedir=/var

  5. Janos SUTO repo owner

    OK, thanks for the version, howeve you pasted $sourcedir/src/config.h and not $sourcedir/piler-config.h. In order to support attachment indexing, you need a few utilities, like pdftotext, catdoc, libzip, and unrtf. if you show me piler-config.h, then I can tell you whether piler knows about these or not.

  6. Dimple Mehta reporter
    1. define CONFDIR "/usr/local/etc"
    2. define DATADIR "/usr/local/var"
    1. define KEYFILE CONFDIR "/piler.key"
    1. define HAVE_DAEMON 1
    1. undef HAVE_PDFTOTEXT
    2. undef HAVE_CATDOC
    3. undef HAVE_CATPPT
    4. undef HAVE_XLS2CSV
    5. undef HAVE_UNRTF
    6. undef HAVE_ZIP
  7. Janos SUTO repo owner

    No, don't change it by hand, but rather install libzip, pdftotext, catdoc and unrar utilities, then run configure again, and recompile, then reinstall the binaries.

  8. Dimple Mehta reporter

    We install libzip,pdftotext,catdoc and unrar.We download the vmware image from your website.

  9. Janos SUTO repo owner

    Ok, so you have installed these utilities, recompiled and reinstalled piler. Then please take an EML format message that have a pdf file, and run:

    ./src/pilertest the-message.eml

    and it should display the contents of the pdf file, too.

  10. Dimple Mehta reporter

    Yes it display the contents of the pdf file.Then i import that mail.Reindex it.But when i do search like "attchment:pdf,abody:GB2127264" then it pulls all the mails.Can you please guide me where i am wrong?

  11. Ronaldo Teixeira

    Hello.

    I have problems searching inside attachments.

    My version: 1.2.0 build 952

    piler-config.h contents:

    #define HAVE_PDFTOTEXT "/usr/bin/pdftotext" #define HAVE_CATDOC "/usr/bin/catdoc" #define HAVE_CATPPT "/usr/bin/catppt" #define HAVE_XLS2CSV "/usr/bin/xls2csv" #define HAVE_PPTHTML "/usr/bin/ppthtml" #define HAVE_UNRTF "/usr/bin/unrtf" #define HAVE_TNEF "/usr/bin/tnef" #define HAVE_ZIP 1 #define HAVE_LIBWRAP 1

    Running pilertest against an eml file containing a pdf I can see the text inside PDF file. But When a search in webui (abody:sometext) any result is displayed.

  12. Ronaldo Teixeira

    The text exists. I can see it when I run pilertest against the eml message file.

    The e-mail was archived 7 days ago and I can find it in a search using "from" or "to".

  13. Ronaldo Teixeira

    pilerget [piler_id of the PDF attachment]

    Shows me entire message, like I see in the eml file.

    PDF file contain the string "P1911701" inside. My search is for that string.

    abody:P1911701

    or just

    P1911701

    OBS:

    I have two attachments in message: PNG and PDF files.

    Executing pilerget separately:

    pilerget 400000005880cf730beac60c0055af6e0e7e 1

    zpipe: invalid or incomplete deflate data

    pilerget 400000005880cf730beac60c0055af6e0e7e 2

    zpipe: invalid or incomplete deflate data

  14. Janos SUTO repo owner

    "Shows me entire message, like I see in the eml file."

    Yes, that's what I need. Btw. why do you keep using 'abody'? It's an invalid keyword.

  15. Ronaldo Teixeira

    I see "abody" here, in this page.

    "Dimple Mehta: Yes it display the contents of the pdf file.Then i import that mail.Reindex it.But when i do search like "attchment:pdf,abody:GB2127264" then it pulls all the mails.Can you please guide me where i am wrong? - 2012-09-24"

    OK. abody is invalid. I understood. But even if I do not use "abody", searching only for the string, the result do not match the message and its attachs.

    When I run pilerget for attachs separately I get an error:

    pilerget 400000005880cf730beac60c0055af6e0e7e 1 zpipe: invalid or incomplete deflate data

    pilerget 400000005880cf730beac60c0055af6e0e7e 2 zpipe: invalid or incomplete deflate data

    Is that a problem?

  16. Janos SUTO repo owner

    Yes, that's a problem, because pileraget is used to get the attachment data separately. Anyway, getting the attachment returns the base64 encoded stuff only. What I need is, however, let me quote myself, "I need to see the pilerget output and your search query". I can't proceed without it.

  17. virusbrain

    I have the same.

    pilerget Output:

    locale: de_DE.UTF-8
    build: 952
    parsing...
    post parsing...
    message-id: <1461147002.12.1485514214565@my.ser.ver.com> / e75c5333a49e4170b6bf84afc7b25291cfe7e515a375bbf93398ac3c31769d31
    from: *test2 betauser test2@domain.com test2 domain com  (domain.com)*
    to: *test2 betauser test2@domain.com test2 domain com  (domain.com )*
    reference: **
    subject: *Mail with pdf*
    body: *here is a nice text
    Textfrompdf
    
    
    *
    sent: 1485514214, delivered-date: 0
    hdr len: 771
    body digest: d88cf0733106a22bd94af6ec54af5d43474576b314391b1fe7a9657e9b75b2cc
    rules check: (null)
    folder: 0
    retention period: 1706445385
    i:1, name=*test.pdf*, type: *application/pdf*, size: 15834, int.name: test.eml.a1, digest: c91e9c448f4b8412af378136d9fdecf0d973e8ed2745f3fe274cae41274e98d7
    attachments:pdf,
    direction: 0
    spam: 0
    

    and my search query:

    sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE        MATCH('@(subject,body)  Textfrompdf') ORDER BY `sent` DESC LIMIT 0,20 OPTION max_matches=1000' in 0.00 s, 0 hits, 0 total found
    

    And yes, I have created a PDF with the text "Textfrompdf" ;-)

    And my piler-config.h

    /* piler-config.h.  Generated from piler-config.h.in by configure.  */
    /*
     * piler-config.h.in, SJ
     */
    
    #define CONFDIR "/etc"
    #define DATADIR "/var"
    #define DATAROOTDIR "/usr/local/share"
    
    #define KEYFILE CONFDIR "/piler/piler.key"
    #define LICENCE_SIGNATURE_FILE CONFDIR "/piler/piler.lic"
    
    #define MESSAGE_ID_DEDUP_FILE DATAROOTDIR "/piler/deduphelper"
    
    #define HAVE_DAEMON 1
    
    #define TIMEOUT_BINARY "/usr/bin/timeout"
    
    #define HAVE_PDFTOTEXT "/usr/bin/pdftotext"
    #define HAVE_CATDOC "/usr/bin/catdoc"
    #define HAVE_CATPPT "/usr/bin/catppt"
    #define HAVE_XLS2CSV "/usr/bin/xls2csv"
    /* #undef HAVE_PPTHTML */
    #define HAVE_UNRTF "/usr/bin/unrtf"
    #define HAVE_TNEF "/usr/bin/tnef"
    /* #undef HAVE_ZIP */
    
    #define HAVE_LIBWRAP 1
    
    /* #undef HAVE_TWEAK_SENT_TIME */
    
    /* #undef HAVE_SUPPORT_FOR_COMPAT_STORAGE_LAYOUT */
    
  18. Janos SUTO repo owner

    Dear Lord! Please help me with putting this to a file clicking on More -> Attach file, and remove the email from the comment itself. Anyway thanks for the data, I'll try to reproduce your search.

  19. Ronaldo Teixeira

    The search string is "takaoka".

    #!
    
    sphinx query: 'SELECT id FROM main1,dailydelta1,delta1 WHERE        MATCH('@(subject,body)  takaoka') ORD
    ER BY `sent` DESC LIMIT 0,100 OPTION max_matches=5000' in 0.00 s, 0 hits, 0 total found
    
  20. Log in to comment