PilerImport STATISTICS
I’d love to see importation statistics for MB and number of messages.
So I have a client with 2 big mails account (97GB e 33GB) and several (dozen!) small one, below 5GB.
When I start importing the big one, I’d get a file with IMAP name of account and 3 numbers
- time to import that account
- original size of all messages
- amount of imported messages / total messages (the delta is the duplicated).
Each import would generate a .statistics file with that info.
So on my second importation (33GB) most of e-mails would be duplicated and pilerimport woulld report them a those numbers would help to justify money to buy a piler enterprise license!
At end I’d have save so much space that just the price of TB would pay for license!
Comments (10)
-
repo owner -
reporter Sorry I do not understand!
I’m importing e-mail from old IMAP server (too much files/e-mails to so little new storage on new server) so I cannot use imapfetch.py to download them and process locally. That’s why I’m using IMAP.
I’d like to undestand: “duplicated e-mail” are imported ? I can see 2 numbers on statistics: processed and duplicated.
I think pilerimport will download the e-mail again and show as duplicated but will not store them. Am I right?
I don’t care about duplicate count again if I run the whole process twice.
Using a gigabyte network I’ve imported 300GB e-mail via IMAP in a day (on weekend!) with no extra overhead on server.
I’ll store all e-mail before 2023 for now and after verify I’ll delete from old server and then import e-mail to new server.
Later I’ll do all messages in/out to get a copied to new server.
In early 2024 I’ll store a new backpu copy of e-mails (including 2023) e-mail and erase them from server. And that point forward all e-mails will be on both sides (server and piler);
-
repo owner I think pilerimport will download the e-mail again and show as duplicated but will not store them. Am I right?
Yes, correct.
-
repo owner - changed status to resolved
-
repo owner Please try this commit. I think it’s better to syslog the results than to write it to a file.
-
repo owner Just ran my pipeline, and it syslogs something like below, so I’ll merge this commit to master.
May 28 16:32:17 49a491284491 pilerimport[632]: imported=47, duplicated=3, discarded=0
-
reporter hum… that’s nice, but could we have more info about WHAT ACCOUNT / DIRECTORY is being imported . I cannot understand that from [632] information. I know… I’m a pain in the ass! D
-
repo owner Edit pilerimport.c, and fix the syslog line at the end to this, and let’s see if it meets your needs:
syslog(LOG_PRIORITY, "server=%s, user=%s, directory=%s, imported=%lld, duplicated=%lld, discarded=%lld", data.import->server, data.import->username, directory, counters.c_rcvd, counters.c_duplicate, counters.c_ignore);
-
reporter I have no skill to compile nothing… sorry.
I’m only a sysadmin and do like to suggest you start the line with something like:
“PilerImport v.99.99 running with parameters ' -d xxxx' || directory=%lld, username=%lld, messages imported-%lld, messages duplicated = %lld, messages discarded=%lld “
so all information is available for debug and also for sysadmin parse to reports.
-
repo owner Then how did you install piler? :-) Anyway, that commit would do it. Case is closed.
- Log in to comment
Well, importing via a mailbox is possible, however processing it naively would be a very inefficient process, because by default piler downloads all emails, and you’ll end up with more and more duplicated messages. So I’d suggest to use the imapfetch.py utility and you may define SINCE 24-May-2023 to download only today’s emails, and then let pilerimport process the downloaded messages. Either way, the first run will take a very long time to get ~100 GB emails.
Anyway, such stat is possible, perhaps pilerimport may syslog them.