PilerImport STATISTICS

Issue #1293 resolved
Jáder Marasca created an issue

I’d love to see importation statistics for MB and number of messages.

So I have a client with 2 big mails account (97GB e 33GB) and several (dozen!) small one, below 5GB.

When I start importing the big one, I’d get a file with IMAP name of account and 3 numbers

  1. time to import that account
  2. original size of all messages
  3. amount of imported messages / total messages (the delta is the duplicated).

Each import would generate a .statistics file with that info.

So on my second importation (33GB) most of e-mails would be duplicated and pilerimport woulld report them a those numbers would help to justify money to buy a piler enterprise license!

At end I’d have save so much space that just the price of TB would pay for license!

Comments (10)

  1. Janos SUTO repo owner

    Well, importing via a mailbox is possible, however processing it naively would be a very inefficient process, because by default piler downloads all emails, and you’ll end up with more and more duplicated messages. So I’d suggest to use the imapfetch.py utility and you may define SINCE 24-May-2023 to download only today’s emails, and then let pilerimport process the downloaded messages. Either way, the first run will take a very long time to get ~100 GB emails.

    Anyway, such stat is possible, perhaps pilerimport may syslog them.

  2. Jáder Marasca reporter

    Sorry I do not understand!

    I’m importing e-mail from old IMAP server (too much files/e-mails to so little new storage on new server) so I cannot use imapfetch.py to download them and process locally. That’s why I’m using IMAP.

    I’d like to undestand: “duplicated e-mail” are imported ? I can see 2 numbers on statistics: processed and duplicated.

    I think pilerimport will download the e-mail again and show as duplicated but will not store them. Am I right?

    I don’t care about duplicate count again if I run the whole process twice.

    Using a gigabyte network I’ve imported 300GB e-mail via IMAP in a day (on weekend!) with no extra overhead on server.

    I’ll store all e-mail before 2023 for now and after verify I’ll delete from old server and then import e-mail to new server.

    Later I’ll do all messages in/out to get a copied to new server.

    In early 2024 I’ll store a new backpu copy of e-mails (including 2023) e-mail and erase them from server. And that point forward all e-mails will be on both sides (server and piler);

  3. Janos SUTO repo owner

    I think pilerimport will download the e-mail again and show as duplicated but will not store them. Am I right?

    Yes, correct.

  4. Janos SUTO repo owner

    Please try this commit. I think it’s better to syslog the results than to write it to a file.

  5. Janos SUTO repo owner

    Just ran my pipeline, and it syslogs something like below, so I’ll merge this commit to master.

    May 28 16:32:17 49a491284491 pilerimport[632]: imported=47, duplicated=3, discarded=0
    

  6. Jáder Marasca reporter

    hum… that’s nice, but could we have more info about WHAT ACCOUNT / DIRECTORY is being imported . I cannot understand that from [632] information. I know… I’m a pain in the ass! D

  7. Janos SUTO repo owner

    Edit pilerimport.c, and fix the syslog line at the end to this, and let’s see if it meets your needs:

    syslog(LOG_PRIORITY, "server=%s, user=%s, directory=%s, imported=%lld, duplicated=%lld, discarded=%lld", data.import->server, data.import->username, directory, counters.c_rcvd, counters.c_duplicate, counters.c_ignore);
    

  8. Jáder Marasca reporter

    I have no skill to compile nothing… sorry.

    I’m only a sysadmin and do like to suggest you start the line with something like:

    “PilerImport v.99.99 running with parameters ' -d xxxx' || directory=%lld, username=%lld, messages imported-%lld, messages duplicated = %lld, messages discarded=%lld “

    so all information is available for debug and also for sysadmin parse to reports.

  9. Log in to comment