Duplicate Mails after Import
Hi,
I have the following problem that some imported mails are listed twice. These also have identical message IDs.
I have now run the following scripts manually, unfortunately without success:
indexer.main.sh
indexer.delta.sh
What could be the reason?
Piler Version: 1.3.11 build 1001
Sphinx: Sphinx 3.3.1 (commit b72d67b)
Comments (41)
-
repo owner -
repo owner Any update?
-
repo owner - changed status to closed
No news is good news.
-
reporter Sorry i had no time to answer. Unfortunately, the error still exists after retesting.
-
repo owner Can you show me the arrived column as well for these messages?
Anyway, I was wrong. The table itself has no such constraint, the piler code tries to prevent such duplication.
Also try setting mmap_dedup_test=1 in piler.conf, then restart the piler daemon.
-
repo owner - changed status to open
-
reporter Ok I will test the setting!
Here again with some more data:
-
repo owner OK, one more query please:
select id, arrived, piler_id, message_id from metadata where id>=2982 and id<=2987;
-
reporter
-
repo owner Thank you. I suggest to enable mmap_dedup_test=1 in piler.conf, then let’s see if you keep finding such duplicates from now on.
-
reporter Ok. I'll let you know!
-
reporter Now I'm confused. If I set mmap_dedup_test = 1, it no longer imports any email. If I set mmap_dedup_test = 0, everything is suddenly ok after a new import and no duplicates can be seen.
Why? :D
I clean up my Installation before set mmap_dedup_test=1
-
repo owner Sorry, I was wrong. This feature is designed for the piler daemon. Now I understand that you are running pilerimport. However, I’m still confused. The pilerimport utility processes the emails sequentially. I still don’t get it how it fails to recognize already archived emails.
-
reporter Yes i use pilerimport in a docker container.
I have now set everything up again, and unfortunately I have the same problem.
-
repo owner Assuming you have a few thousands of emails to test with let’s try one more thing. Import emails one at a time, eg.
for i in *.eml; do pilerimport -e $i; sleep 1; done
-
reporter Just for info, we import the mails via IMAP and pilerimport.
Example:
pilerimport -i imap.my-server.com -u imap-mail@my-server.com -p '<PASSWORD>' -P 993 -f <FOLDER_WITH_MAILS> -r
-
repo owner Pilerimport has an option to download only: -o
-
reporter Okay when only download the mails with pilerimport -o i do not get .eml files but .txt files like "13303-imap-2033.txt"
-
repo owner They are fine.
-
reporter Yes but the command
pilerimport -e
works only with .eml files or am I misunderstanding you? -
repo owner Correct. However pilerimport writes eml files.
-
reporter Ok.
When the loop runs with the 1 second pause, there are no more duplications!
-
reporter @Janos SUTO Do you have any other ideas what we can do? Greetings :)
-
repo owner Well, not really, I’m afraid. I just can’t reproduce the issue. I have a similar test environment also in docker, running pilerimport, and it properly detects duplicates:
Cipher: TLS_AES_256_GCM_SHA384 List of IMAP folders: => '"INBOX" [\HasNoChildren]' skipping => '"[Gmail]" [\HasChildren \Noselect]' => '"[Gmail]/All Mail" [\All \HasNoChildren]' => '"[Gmail]/Drafts" [\Drafts \HasNoChildren]' => '"[Gmail]/Important" [\HasNoChildren \Important]' => '"[Gmail]/Sent Mail" [\HasNoChildren \Sent]' => '"[Gmail]/Spam" [\HasNoChildren \Junk]' => '"[Gmail]/Starred" [\Flagged \HasNoChildren]' => '"[Gmail]/Trash" [\HasNoChildren \Trash]' processing folder: "[Gmail]/Spam"... found 0 messages processing folder: "[Gmail]/All Mail"... found 12 messages Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't read xref table processed: 12 [100%] processing folder: "[Gmail]/Sent Mail"... found 0 messages processing folder: "[Gmail]/Important"... found 9 messages duplicate: 764-imap-13.txt (duplicate id: 2897) duplicate: 764-imap-14.txt (duplicate id: 2898) Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't find trailer dictionary Syntax Error: Couldn't read xref table duplicate: 764-imap-15.txt (duplicate id: 2899) duplicate: 764-imap-16.txt (duplicate id: 2902) duplicate: 764-imap-17.txt (duplicate id: 2903) duplicate: 764-imap-18.txt (duplicate id: 2905) duplicate: 764-imap-19.txt (duplicate id: 2906) duplicate: 764-imap-20.txt (duplicate id: 2907) duplicate: 764-imap-21.txt (duplicate id: 2908)
-
repo owner Btw. I’m not sure if those 2-3000 emails are sensitive or not, but if it’s possible to give access to that mailbox, then I could test with it.
-
reporter Ok no unfortunately that does not work with the access to our mailbox.
Then two other questions about the commands:
1. Is it possible to achieve throttling through the limit (-b, -s)?
2. Is there a debug we can run? Then we could send it to you. -
repo owner You may try tweaking the -b and -s options. Also you may try setting verbosity=5, so pilerimport will syslog much more. Or perhaps I may add an option to pilerimport to wait a few milliseconds after importing each email.
-
reporter The option that you always wait a few ms would of course be great. Maybe you can build it in. At least it would solve the problem.
-
repo owner OK, try this commit: https://bitbucket.org/jsuto/piler/commits/aab7b712d20c8885f66a17feb0a5aa4f9056d839
It introduces the -Z option to add a <Z> milliseconds delay between importing each emails. It should be between 1-999. Let me know how it goes.
-
reporter Thank-you! It works correctly with the -Z parameter.
-
repo owner - changed status to resolved
Great. How many milliseconds of delay did the trick?
-
reporter I used 50ms but maybe you can set it lower.
-
reporter Sorry I have to open again.
After I have now set the -Z parameter, he suddenly imports only 1000 of 2000 mails. I tried again and also changed the -Z parameter but unfortunately without success.
-
repo owner How adding a small delay before importing an individual email would cause not importing half of the emails?
-
reporter That's a good question... I'll import again today without the Z parameter and see if the 2000 mails are imported again.
-
reporter Hey I did it without the -Z parameter and only 1008/2058 mails were imported. My old Docker container was based on “piler-1.3.11.tar.gz” so the import worked and 2058 mails are imported. Now I pull the source code directly in the Docker project. Has anything changed at the source?
-
repo owner Yes, since the project is developing, there are changes. However, I don’t think there anything that should explain why only half of the emails are imported. I’m not sure if you have half of the emails as duplicates. Anyway, I can’t help you unless you provide me a similar mailbox with roughly the same amount of emails with some duplicates to see the issue for myself. If you can’t or won’t create it for me, then good luck find the solution for yourself.
-
reporter We don't blame you and are happy and grateful that you are developing this project further!
Before we think about how we can grant you access to our e-mails in a privacy compliant manner, we would still like to briefly explain what we have done now.
- In the MySql database, all 2000 mails can be seen in the metadata table
- However, only the 1000 mails mentioned can be viewed in the Sphinx database
Do you have another idea how we can debug this? The version from the download "piler-1.3.11.tar.gz" works without problems, and it is strange that we are the only ones who have this problem. We have changed nothing except the source code to the old working version.
-
reporter We have now created another email account for you with one Mail which is not imported. This mail is processed but not displayed at the end in the Mailpiler client administration. Maybe you can see why this happens?
The Mail has the message_id “<1593837997.519141283248938032.JavaMail.support@geotrust.com>“
We copy the Mail with Webmailer to the folder “mailarchiv” after this we run the import command.
<censored content>
-
repo owner I can’t see any email using this account. Btw. feel free to edit your previous message, and remove the password. You know, it’s a publicly accessible issue tracker.
-
reporter Hello, there is a mail in the inbox that is not displayed by Piler and yes thats ok i change the password of the account later… i create it only for test for you
- Log in to comment
How is that possible? The metadata table has a constraint to prevent the same message-id in the metadata table.
So take a message-id that has multiple emails associated in the search results, and verify that this given message-id is in the metadata table in multiple rows.