Wrong Space Projection calculation
The calculation of the avg. Messagesize is wrong, because total Messagesize divided by the received messages.(not total Messages) Sample
Total Spaceconsumption 2400 MB Total Messages 34215 received Messages 850
2400/850 = 2,8 MB
But total Messages is 34215 Messages So right calculation 2400/34215
or
Size of the 850 Messages divided by 850
Cumulative Counts Processed emails 115 (24 hours) 34215 (1 week) 34215 (30 days)
Message Disposition received messages 850 infected messages 0 duplicated messages 185 ignored messages 0
Space Projection Average Messages per Day 4888 *Average Message + Metadata + Index Size 2.8 M + 74 k + 0.3 M* Average Size per Day 15342.5 M "/var" Partition Projected to be Full in 0 years, 0 months, 0 days Usage Trend Increasing
Comments (8)
-
-
reporter Hello Remi, in my experience the average mailvolume increase every year. So for a future projection it is very important how much mails in the last 3 months are stored in the archive. Important is also the average mailsize. This is in my case a few MB, whtats defenitly wrong.
As far I can see the size col exists in mySQL metadata and attachment.
And the pilerimport function does not count the mails for the statistics, but the volume. See my first post. If you want you can get a screenshot. Am 22.04.2013 22:00, schrieb Remi S:
-
repo owner merged Remi's changes to piler to the master branch.
-
repo owner I just wonder whether the latest approach makes sense or not.
-
Karsten,
My initial assumption is that the average size of an email message would be relatively constant over time (given a large enough sample size) but the volume would fluctuate greatly as new users were added to the archive. That's why the calculations you noted were not 100% accurate over a small amount of data.
The average size of an email in Piler includes three components: 1 - the metadata of the email/attachment(s) stored in MySql 2 - the Sphinx index of the email 3 - the actual size of the email and attachment(s) on disk
While the metadata table does include timestamps for each message, the Sphinx index and the files stored on disk do not. Further, there is not a 1 to 1 comparison between the sizes in the database tables and the size on disk, due to deduplication and compression. Because of this, it is only feasible to calculate the size of 2 & 3 for all emails stored in the Piler archive but it is possible to compute the count of emails received over any time period.
So the two options seem to be: A: As the code was (total daily size average * recent daily count average) B: As the code is now (total daily size average * total daily count average)
Janos, Have I missed anything?
Which is the better compromise?
Let me know your thoughts, Remi
-
repo owner For me version A looks more reasonable, because I believe it represents the daily or current changes more accurately. Let's assume that for some reason piler receives half of the usual daily volume. I think we get a more accurate prediction that represents the current or actual trend.
Anyway if for some reason version B is preferred for some installations, perhaps a config.php option (eg. SPACE_PROJECTION_BASED_ON_RECENT_DAY = 0|1) may satisfy both needs. I think it's a trivial condition in the code to decide how to calculate. And please make version A (=1) the default.
-
repo owner In the meantime I fixed a bug in pilerimport to count the message numbers, too.
-
repo owner - changed status to resolved
I conclude this issue resolved
- Log in to comment
This is the result of a compromise in how this was calculated. Since there is no way to accurately determine the size of all three facets (message, metadata, and index) of the Piler data over any time period other than for all-time, I used the all-time size numbers but the weekly or monthly message count numbers when computing the Average Size Per Day. My rational was that, over time, the average message size would fluctuate less than the message count per day, and thus the projection would be better served by using a more recent value for the message count per day. This holds true for our business model, where we are constantly adding users (and thus receiving more email volume) to the archive but the distribution of message sizes doesn't change that much. I recognize, thought, that it may not be valid for everyone. I'd be interested in hearing thoughts on the two approaches.
In my most recent commit, I altered the code to work slightly differently. There is now a method in the health model to determine the date of the oldest message, which allows for an all-time average messages per day number to be calculated. Then, this average count is used to determine the Average Size Per Day instead of the weekly average count.
This alteration changed the numbers on my dev system by about a third (Avg size per day decreased from 13.7M to 9.4M), so it was definitely noticeable.
I also added some more comments to (hopefully) explain the process involved in coming up with the space usage and projection information.