Wrong Space Projection calculation

Issue #82 resolved
Karsten Bandlow created an issue

The calculation of the avg. Messagesize is wrong, because total Messagesize divided by the received messages.(not total Messages) Sample

Total Spaceconsumption 2400 MB Total Messages 34215 received Messages 850

2400/850 = 2,8 MB

But total Messages is 34215 Messages So right calculation 2400/34215

or

Size of the 850 Messages divided by 850

Cumulative Counts Processed emails 115 (24 hours) 34215 (1 week) 34215 (30 days)

Message Disposition received messages 850 infected messages 0 duplicated messages 185 ignored messages 0

Space Projection Average Messages per Day 4888 *Average Message + Metadata + Index Size 2.8 M + 74 k + 0.3 M* Average Size per Day 15342.5 M "/var" Partition Projected to be Full in 0 years, 0 months, 0 days Usage Trend Increasing

Comments (8)

  1. Remi S

    This is the result of a compromise in how this was calculated. Since there is no way to accurately determine the size of all three facets (message, metadata, and index) of the Piler data over any time period other than for all-time, I used the all-time size numbers but the weekly or monthly message count numbers when computing the Average Size Per Day. My rational was that, over time, the average message size would fluctuate less than the message count per day, and thus the projection would be better served by using a more recent value for the message count per day. This holds true for our business model, where we are constantly adding users (and thus receiving more email volume) to the archive but the distribution of message sizes doesn't change that much. I recognize, thought, that it may not be valid for everyone. I'd be interested in hearing thoughts on the two approaches.

    In my most recent commit, I altered the code to work slightly differently. There is now a method in the health model to determine the date of the oldest message, which allows for an all-time average messages per day number to be calculated. Then, this average count is used to determine the Average Size Per Day instead of the weekly average count.

    This alteration changed the numbers on my dev system by about a third (Avg size per day decreased from 13.7M to 9.4M), so it was definitely noticeable.

    I also added some more comments to (hopefully) explain the process involved in coming up with the space usage and projection information.

  2. Karsten Bandlow reporter

    Hello Remi, in my experience the average mailvolume increase every year. So for a future projection it is very important how much mails in the last 3 months are stored in the archive. Important is also the average mailsize. This is in my case a few MB, whtats defenitly wrong.

    As far I can see the size col exists in mySQL metadata and attachment.

    And the pilerimport function does not count the mails for the statistics, but the volume. See my first post. If you want you can get a screenshot. Am 22.04.2013 22:00, schrieb Remi S:

  3. Remi S

    Karsten,

    My initial assumption is that the average size of an email message would be relatively constant over time (given a large enough sample size) but the volume would fluctuate greatly as new users were added to the archive. That's why the calculations you noted were not 100% accurate over a small amount of data.

    The average size of an email in Piler includes three components: 1 - the metadata of the email/attachment(s) stored in MySql 2 - the Sphinx index of the email 3 - the actual size of the email and attachment(s) on disk

    While the metadata table does include timestamps for each message, the Sphinx index and the files stored on disk do not. Further, there is not a 1 to 1 comparison between the sizes in the database tables and the size on disk, due to deduplication and compression. Because of this, it is only feasible to calculate the size of 2 & 3 for all emails stored in the Piler archive but it is possible to compute the count of emails received over any time period.

    So the two options seem to be: A: As the code was (total daily size average * recent daily count average) B: As the code is now (total daily size average * total daily count average)

    Janos, Have I missed anything?

    Which is the better compromise?

    Let me know your thoughts, Remi

  4. Janos SUTO repo owner

    For me version A looks more reasonable, because I believe it represents the daily or current changes more accurately. Let's assume that for some reason piler receives half of the usual daily volume. I think we get a more accurate prediction that represents the current or actual trend.

    Anyway if for some reason version B is preferred for some installations, perhaps a config.php option (eg. SPACE_PROJECTION_BASED_ON_RECENT_DAY = 0|1) may satisfy both needs. I think it's a trivial condition in the code to decide how to calculate. And please make version A (=1) the default.

  5. Log in to comment