A couple of questions

Issue #487 resolved
Craig Jackson created an issue

Hi, Firstly thank you ever so much for this excellent mail archive solution. I have been experimenting with it the last few days, after installing it on a CentOS 6.4 VM. It seems to work very well, however I have a couple of questions I wonder if you could help with.

I am not an expert Linux or DB admin, more novice / intermediate, so apologies if some of my issues are of a very basic nature!

There are two things I am a little confused by. Firstly, regarding the search indexing. I don't quite understand how this works. I had thought that the regular indexer.delta.sh script would only need to index mails which had been added since the last time it ran (I assumed it used the email index id's to know which mails it had not seen). However I find today that even if I run indexer.delta.sh and then immediately run it again, it takes the same length of time (around 2.5 minutes). Is this normal?

I can see from 'select count(*) from sph_index' that new mails added are there (ef if I add 100 mails then this value is 100) and that this value goes to 0 after running indexer.delta.sh.

Secondly, today I added a bunch of mail from an mbox file which is created by postfix on my mail server using always_bcc. Basically, I copy this file off to another location at the end of each day and date/timestamp it. I then keep these as 'original' versions of all email, which I can use to recreate my archive if required in the future (I was using a hypermail archive previously). So I added yesterdays mbox format 'bcc' file , which should have all yesterdays mail in it. with pilerimport. That was fine. I then manually copied off the bcc mbox file for today so far and again used pilerimport to import it. I was surprised to see that out of 50 emails, there were 11 duplicates (I could see this in the output of pilerimport).

How can there be duplicate mails in this situation? Is this some form of deduplication relating to a conversation thread?

Many thanks in advance for your reply. Regards Craig

Comments (6)

  1. Craig Jackson reporter

    OK, so probably cancel question number 2. I started fresh and the first thing I did was import the mbox file that I had seen all those duplicates on. Even with an empry archive it happened again, so I guess it really is duplicate emails, maybe something I did wrong when I took a copy of the bcc mbox file.

    Question one about the searching still stands though! AND.... Can I add one more question - is there any way to reset the duplicate statistic? It is useful, however once I know about it, it would be nice to reset it, as it makes it look like the archive is full of duplicate mail (which I know it isn't of course, as piler didn't import the duplicate)

  2. Janos SUTO repo owner

    I'm not sure any of these are a minor bug, the usual place would be the mailing list for such questions.

    The index script has some sleep commands, and since the sphinx indexer is pretty fast, it's normal to see the same elapsed time. The deduplication you saw is based on the Message-id. If two or more messages share the same, then they are duplicated, and after storing the first, the rest can be discarded.

    If you prefer, you may reset the deduplicated counter, see the counter table. Anyway it's pretty common that a single message is stored (on the mail server, not on piler) in several copies if it has multiple recipients.

  3. Craig Jackson reporter

    Thanks, Apologies for flagging as a bug, I wasn't suggesting it was a bug, I just didn't know what to flag it as ;-)

    This was the only place I could see to discuss piler (it came up in hits when searching for more info). Is there an appropriate discussion forum then?

    So the apparrent time that the indexer runs for (if run manually) does not really relate to the actual work it is doing? Am I right that the indexer only needs to index new mails (mails added since it's last run) each time it runs?

    Thanks for the excellent product and quick help :-)

  4. Janos SUTO repo owner

    I suggest to use the mailing list. Anyway yes, you are right, the delta indexer just processes new emails. If you have tens of 1000s of emails as new, then run time correlates better with the actual job. Usually not something to worry about.

  5. Log in to comment