Aggregate, store, and usefully display metrics.
- Defining Use Case:
- Cost should never be a concern when adding a new metric.
- Exploratory - The user can search, compare, and zoom in on interesting anomalies.
- Responsive - The whole system (ui/storage/retrieval) database is low latency.
- Interactive - Users can share specific reports, plots, and settings with a simple permanent URL.
- Scaling: FIXME: Are these roughly in descending priority order?
- Many Metrics - one million distinct metrics per node in the collection cluster.
- Complete History - never delete/forget metrics; store all data for all time. (-in contrast to RRDtool)
- Horizontally Scalable - if your Saturnalia hosts are falling over, it should be simple to add new ones to properly alleviate load.
- High Frequency - handle high-frequency metrics, such as once per second.
- Robustness: FIXME: Are these roughly in descending priority order?
- Data Integrity - Never record incorrect data.
- Crash-Only - All components are built to expect their own spontaneous failure. See Crash-Only Design.
- Defensive Decoupling - A component should continue to operate even when other components have failed.
- Failover - Even when a component fails, the functioning components should accommodate the resulting changes in their own load for some amount of time.
- High Availability - Accept new metrics even when sub components are unresponsive. FIXME: Is this a goal? Does this conflict with other robustness goals?