pilerimport not finding message-id on some messages where message-id exists

Issue #129 resolved
datapharmer created an issue

When processing through a large archive I realized only about 25% of the messages were actually imported. I tried an individual message and it was Identified as a duplicate, but the number printed with it was not anything I could find in the message (id did not seem to match a message id or anything). I checked the message and it does contain a message ID. I don't think pilerimport is finding the message ID because setting archive_emails_not_having_message_id=1 allows the message to be imported.

Message ID of message not imported: Message-Id: BE1ED7B1A49A224597DEF3CD2F98E3CD48BDD03A@BN1PRD0612MB636.namprd06.prod.outlook.com

Message ID of message that is imported: Message-ID: A92FAC2970B7844AA2A0EDBEA1D1FE1E017DD4738AAC@abcex References: F50890CAED3CF84F9F69A99DAE873AA613485E48@BL2PRD0611MB435.namprd06.prod.outlook.com

I can provide additional header information privately if that is more helpful.

Comments (11)

  1. datapharmer reporter

    Ok, there is possibly more to this than just the message ID. Even after waiting for the index to update I cannot find this message either by subject or from address from an auditor account. Running import causes the processed emails number to go up by 1 and the duplicate, infected and ignored remain the same but I can't find any sign of this message by searching to save my life.

  2. datapharmer reporter

    Ok, I looked around a little more and the pst I imported was for June (GMT). There are less emails than I would expect for any given day, but there are none at all after the first 5 day. Searching for the entire date range shows 1000 emails but piler reports that 63477 have been processed. I'm not sure where the discrepancy is, but I am guessing there is some sort of limiting going on somewhere.

  3. Janos SUTO repo owner

    Please run the pilertest utility, it prints out some statistics about the email:

    piler-0.1.xx/src/pilertest /path/to/message.eml

    please check if pilertest can recognise the message-id. If it can't see for both messages, then please send the (full) headers to my address privately.

    Note that piler (or sphinx) doesn't return all possible hits, just 1000 (can be overriden). So 1000 actually means 1000+.

    Please also show me piler -V, I'm interested in the build number. If it's between 832-834, then you should pull the latest master branch, and update the piler binaries asap.

  4. datapharmer reporter

    I am in that range (though webgui is newer) but pulling the latest master I got configure: error: invalid variable name: ` --localstatedir' on configuring for install

  5. datapharmer reporter

    If I remove localstatedir I also get configure: error: invalid variable name: ` --with-database' but if I remove with-database I get please specify the used database with --with-database=...

  6. Janos SUTO repo owner

    for the master branch I usually use the following:

    ./configure --localstatedir=/var --enable-starttls --enable-tcpwrappers --with-database=mysql

    it executes properly for my all systems

  7. datapharmer reporter

    well that worked for the build.... not sure what was different

    Anyhow... pilertest found the message id and on running pilerimport I got:

    root@abcmp1:/var/mail# pilerimport -e /var/mail/challenge.msg.mime processing: /var/mail/challenge.msg.mime duplicate: /var/mail/challenge.msg.mime (id: 4000000052165d9b251d1b3400f35fd202e5)

    If I run it again, I get a different ID: root@abcmp1:/var/mail# pilerimport -e /var/mail/challenge.msg.mime processing: /var/mail/challenge.msg.mime duplicate: /var/mail/challenge.msg.mime (id: 4000000052165ebd0fb7569c00a98090576d)

    I don't know if that is normal or not... duplicated messages now also iterate matching this message now on the web admin, but if I try to search for it it still doesn't appear. I've searched for the from address, to address, a single word in the subject, and even just the date it was sent (I even checked a 3 day range 1 before through 1 day after). I'm lost as to why it won't show up!

  8. Janos SUTO repo owner

    well, I'm a bit blind that I can't see what you can. However try searching for only the subject.

    Prepend a the "subject:" keyword, and write a few words in the subject. If you picked a not one in a million message, then it should appear in the search results.

    Also check ls -la /var/piler/sphinx for file sizes. On Debian / Ubuntu if you enabled sphinx in the /etc/defaults/* files, then the /etc/cron.d/sphinx runs indexer once a day practically destroying any previous sphinx index data.

  9. datapharmer reporter

    Yes, I tried searching the subject and there are definitely numerous messages not appearing in the index. The output from ls is as follows:

    root@abcmp1:/var/mail# ls -la /var/piler/sphinx total 172476 drwxr-xr-x 2 piler piler 4096 Aug 22 17:35 . drwxr-xr-x 7 piler piler 4096 Aug 19 11:54 .. -rw-r--r-- 1 piler piler 56 Aug 22 17:05 dailydelta1.spa -rw-r--r-- 1 piler piler 1 Aug 22 17:05 dailydelta1.spd -rw-r--r-- 1 piler piler 582 Aug 22 17:05 dailydelta1.sph -rw-r--r-- 1 piler piler 1 Aug 22 17:05 dailydelta1.spi -rw-r--r-- 1 piler piler 0 Aug 22 17:05 dailydelta1.spk -rw-r--r-- 1 piler piler 0 Aug 22 17:05 dailydelta1.spm -rw-r--r-- 1 piler piler 1 Aug 22 17:05 dailydelta1.spp -rw-r--r-- 1 piler piler 1 Aug 22 17:05 dailydelta1.sps -rw-r--r-- 1 piler piler 0 Aug 22 17:35 delta1.new.spa -rw-r--r-- 1 piler piler 1 Aug 22 17:35 delta1.new.spd -rw-r--r-- 1 piler piler 582 Aug 22 17:35 delta1.new.sph -rw-r--r-- 1 piler piler 1 Aug 22 17:35 delta1.new.spi -rw-r--r-- 1 piler piler 0 Aug 22 17:35 delta1.new.spk -rw-r--r-- 1 piler piler 0 Aug 22 17:35 delta1.new.spm -rw-r--r-- 1 piler piler 1 Aug 22 17:35 delta1.new.spp -rw-r--r-- 1 piler piler 1 Aug 22 17:35 delta1.new.sps -rw-r--r-- 1 piler piler 0 Aug 22 17:30 delta1.spa -rw-r--r-- 1 piler piler 1 Aug 22 17:30 delta1.spd -rw-r--r-- 1 piler piler 582 Aug 22 17:30 delta1.sph -rw-r--r-- 1 piler piler 1 Aug 22 17:30 delta1.spi -rw-r--r-- 1 piler piler 0 Aug 22 17:30 delta1.spk -rw-r--r-- 1 piler piler 0 Aug 22 17:30 delta1.spm -rw-r--r-- 1 piler piler 1 Aug 22 17:30 delta1.spp -rw-r--r-- 1 piler piler 1 Aug 22 17:30 delta1.sps -rw-r--r-- 1 piler piler 191940 Aug 22 02:30 main1.spa -rw-r--r-- 1 piler piler 115478656 Aug 22 02:30 main1.spd -rw-r--r-- 1 piler piler 582 Aug 22 02:30 main1.sph -rw-r--r-- 1 piler piler 41230233 Aug 22 02:30 main1.spi -rw-r--r-- 1 piler piler 0 Aug 22 02:30 main1.spk -rw-r--r-- 1 piler piler 0 Aug 22 02:30 main1.spm -rw-r--r-- 1 piler piler 19522222 Aug 22 02:30 main1.spp -rw-r--r-- 1 piler piler 1 Aug 22 02:30 main1.sps -rw-r--r-- 1 piler piler 0 Aug 19 14:06 main2.spa -rw-r--r-- 1 piler piler 1 Aug 19 14:06 main2.spd -rw-r--r-- 1 piler piler 582 Aug 19 14:06 main2.sph -rw-r--r-- 1 piler piler 1 Aug 19 14:06 main2.spi -rw-r--r-- 1 piler piler 0 Aug 19 14:06 main2.spk -rw-r--r-- 1 piler piler 0 Aug 19 14:06 main2.spm -rw-r--r-- 1 piler piler 1 Aug 19 14:06 main2.spp -rw-r--r-- 1 piler piler 1 Aug 19 14:06 main2.sps -rw-r--r-- 1 piler piler 0 Aug 19 14:06 main3.spa -rw-r--r-- 1 piler piler 1 Aug 19 14:06 main3.spd -rw-r--r-- 1 piler piler 582 Aug 19 14:06 main3.sph -rw-r--r-- 1 piler piler 1 Aug 19 14:06 main3.spi -rw-r--r-- 1 piler piler 0 Aug 19 14:06 main3.spk -rw-r--r-- 1 piler piler 0 Aug 19 14:06 main3.spm -rw-r--r-- 1 piler piler 1 Aug 19 14:06 main3.spp -rw-r--r-- 1 piler piler 1 Aug 19 14:06 main3.sps -rw-r--r-- 1 piler piler 0 Aug 19 14:06 main4.spa -rw-r--r-- 1 piler piler 1 Aug 19 14:06 main4.spd -rw-r--r-- 1 piler piler 582 Aug 19 14:06 main4.sph -rw-r--r-- 1 piler piler 1 Aug 19 14:06 main4.spi -rw-r--r-- 1 piler piler 0 Aug 19 14:06 main4.spk -rw-r--r-- 1 piler piler 0 Aug 19 14:06 main4.spm -rw-r--r-- 1 piler piler 1 Aug 19 14:06 main4.spp -rw-r--r-- 1 piler piler 1 Aug 19 14:06 main4.sps -rw-r--r-- 1 piler piler 0 Aug 22 17:30 note1.spa -rw-r--r-- 1 piler piler 1 Aug 22 17:30 note1.spd -rw-r--r-- 1 piler piler 289 Aug 22 17:30 note1.sph -rw-r--r-- 1 piler piler 1 Aug 22 17:30 note1.spi -rw-r--r-- 1 piler piler 0 Aug 22 17:30 note1.spk -rw-r--r-- 1 piler piler 0 Aug 22 17:30 note1.spm -rw-r--r-- 1 piler piler 1 Aug 22 17:30 note1.spp -rw-r--r-- 1 piler piler 1 Aug 22 17:30 note1.sps -rw-r--r-- 1 piler piler 0 Aug 22 17:30 tag1.spa -rw-r--r-- 1 piler piler 1 Aug 22 17:30 tag1.spd -rw-r--r-- 1 piler piler 288 Aug 22 17:30 tag1.sph -rw-r--r-- 1 piler piler 1 Aug 22 17:30 tag1.spi -rw-r--r-- 1 piler piler 0 Aug 22 17:30 tag1.spk -rw-r--r-- 1 piler piler 0 Aug 22 17:30 tag1.spm -rw-r--r-- 1 piler piler 1 Aug 22 17:30 tag1.spp -rw-r--r-- 1 piler piler 1 Aug 22 17:30 tag1.sps

    I'm going to go ahead and reindex everything overnight and will see what that gives me...

  10. datapharmer reporter

    after running reindex -a the messages are appearing. I'm going to import another pst and see if the problem was happenstance or if I've got some sort of indexing problem, but it appears sphinx related.

  11. Janos SUTO repo owner

    Please make sure to check /etc/defaults/sphinxsearch. It should have START="no". And perhaps you want to disable /etc/cron.d/sphinxsearch, in order to absolutely prevent it running and destroying piler related index files.

  12. Log in to comment