Disable deduplication based on Message-ID

Issue #602 on hold
ercpe created an issue

Hi,

we are migrating from another archiving solution to Piler. After importing around 3 million mails i see that a good percentage of mails are missing. From what i can see, the pilerimport command discards mails as duplicates based on the message-id. However it seems that Thunderbird generates "wrong" Message-IDs (or at least not unique enough) so that many mails get ignored as duplicates even when they aren't related at all and only happen to be sent around the same time.

Now i'm looking for a way to either disable deduplication alltogether or make it based on other criterias. I found the lines 247-258 in message.c and wonder if i can just remove these lines?

Do you see any other way around this? Changing the Message-ID for all existing mails and re-importing them may be possible but this won't work for future mail...

Comments (14)

  1. Janos SUTO repo owner

    You are free to remove lines from the source code, however I believe it's much better if we identify the root cause of the problem. If thunderbird assigns the same message-id to >1 messages, then it's obviously flawed. So for starters let's verify that there are >=2 messages with the very same message-id. Normally MUAs incorporate the timestamp to prevent such error, and I think thunderbird does the same.

  2. ercpe reporter

    This problem is only partially related to Thunderbirds Message-ID generation. Thunderbird uses the format xxx.yyy@maildomain where xxx is the timestamp in hex and yyy is a salt of some kind. Imho using maildomain as the last part makes them less unique on busy mail domains (i've seen other applications which use the same logic).

    In all of our mails i found several examples of duplicated Message-IDs. Often these Message-IDs were copied when an application answers the mail (e.g. autoresponder) or when an application sends or handles mails (e.g. the same Message-ID is used for notification emails which is sent to multiple recipients in seperate mails). Additionally i found a few mails from mail clients (TB in this case) which have the same Message-ID and are totally unrelated. While all these problems are clearly in third-party software it causes mails to be dropped when imported in mailpiler.

    Removing the above lines would actually cause much more problems: From what i see, the hash of the message id is used as a unique identifier (e.g. for the data filename). Without the current duplicate check the message content would be overwritten with a (possible different) mail. So without changing the fields on which the "unique id" of a mail inside piler is generated and used to fetch the metadata information this should not be done.

  3. Janos SUTO repo owner

    OK, thanks for the clarification. I think adding the domain to the message-id is not a problem as long as xxx and yyy are created properly. Usually MUAs add the message-id of the message you reply to the Reference: header section, and create a new message-id for the new message.

    Regarding the fix, it depends on greatly whether the problem affects legacy emails or even new emails suffer from the problem. If new emails are fine, then I suggest to write a script that fixes the message-ids before importing. If even new emails are affected, then the mailer application should be fixed - if it's possible.

    If neither of these works, then we may fix piler to disable the duplication detection en block (disabling the lines between 247 and 283 in message.c shouldn't be a problem, but we have to test if everything works properly after the modification), however it will surely come at a price: you'll end up having the same emails in multiple copies in the archive.

    So let me know what options we have with this issue, then decide what's next.

  4. eXtremeSHOK

    What if one has an option to also use a md5sum/hash of the email body and/or header ?

    This could be an extra optional check to prevent duplication issues.

    With IMAP merging we usually compare the message id and the body when checking for duplicates

  5. Janos SUTO repo owner

    Not sure. The same body may belong to a different message, and a message to 3 recipients results 3 variants of the sent header.

  6. ercpe reporter

    I would propose to use sender, recipient and message-id as the values for the unique id.

    This covers the following cases:

    • From: foo@somecompany, To: userA@example.com, Message-ID: 123 -> Unique
    • From: foo@somecompany, To: userA@example.com, userB@example.com, Message-ID: 456 -> Unique

    and

    • From: noreply@somewhere: To: userA@example.com, Message-ID: 789 -> Unique
    • From: noreply@somewhere: To: userB@example.com, Message-ID: 789 -> Unique

    Note: this does not cover all cases, so mails may get removed as duplicated. But this is the most we could do to work around buggy software.

    As far as i can see this change would need the following changes:

    • Building the message_id_hash based on from, to, mid
    • Storing this unique id as a seperate column (the message_id column must still be used for looking up mails based on the Reference header)
    • Querying metadata from the database based on this id and not based on the Message-ID

    For my current situation: We will probably ignore this for the moment and loose mails with the import and for the forseeable future (have to check with my boss, though). Our mailserver logfiles for the last months shows no Message-ID duplication so it seems to occur for < 0.01 % of our current mail flow or affects mostly old mails (we're importing from another solution which data was based very old EML-Files which had data corruption anyway).

  7. Janos SUTO repo owner

    @extremeshok: OK, I get it now.

    @ercpe: I'm reluctant to add a new column to the metadata table. Rather I suggest to tweak the parser to concat the values of 'from', 'to' and 'message-id', then create an sha256 hash value, and use (and store) the hash value as message-id. The default behaviour would be the usual message-id only check, and the new behaviour could be flipped by a config parameter in piler.conf.

    I have the following diff in my head:

    diff --git a/src/parser.c b/src/parser.c
    index 80fa320..07f8359 100644
    --- a/src/parser.c
    +++ b/src/parser.c
    @@ -100,7 +100,7 @@ struct _state parse_message(struct session_data *sdata, int take_into_pieces, st
    
     void post_parse(struct session_data *sdata, struct _state *state, struct __config *cfg){
        int i, len, rec=0;
    -   char *p;
    +   char *p, buf[MAXBUFSIZE];
    
        clearhash(state->boundaries);
        clearhash(state->rcpt);
    @@ -145,6 +145,9 @@ void post_parse(struct session_data *sdata, struct _state *state, struct __confi
        }
    
    
    +   snprintf(buf, sizeof(buf)-1, "%s%s%s", state->message_id, state->b_from, state->b_to);
    +   digest_string(buf, state->message_id);
    +
        digest_string(state->message_id, &(state->message_id_hash[0]));
    
     }
    
  8. ercpe reporter

    Correct if i'm wrong, but changing the message_id in state would cause the triple to be inserted in the message_id column in metadata. Wouldn't this break the lookup for Reference IDs?

  9. Janos SUTO repo owner

    You are right, it would affect the gui search by reference feature (=when you click on the [+] at then end of the subject in the gui). It seems that a new column is unavoidable. The question is whether you insist on this feature, or you can live with the current historic archive, provided that new emails are OK.

    If you insist on it, then I'd try to hide the feature as much as it's possible to prevent accidental turning on. It may need to enable an #ifdef macro in the Makefile or similar manually.

  10. ercpe reporter

    We will now live with the false duplicates as it affects only 0.1 to < 1.0 % of all mails for the last 5 years and we don't expect much more for future mails.

    However, i would still like to see this feature in a future version. As such, i have marked this ticked as an enhancement.

  11. Log in to comment