Dozor monitoring system architecture overview


1. message bus (redis pubsub)
2. dozor agent
3. graphite collector
4. alert collector

## Dozor agent

- nodejs-based daemon
- every plugin is a child process, controlled by master process by IPC

### master process role

- start process for every plugin ( plugins/<plugin_name>/index.js ) using process.fork
- receive 'register' messages for plugins stats tracking
- receive 'fire' messages
- forward 'fire' messages to message bus
- process commands from 'dozor.control' messagebus channel
  - stats   => report versions & uptime of loaded plugins
  - restart => restart all plugins

#### IPC message format:

  'type' : < fire | error | register >,
  'from' : < plugin_id >,
  'desc' : < plugin message description >,
  'data' : < plugin data which can be used for graphing >


  'type' : 'fire',
  'from' : 'os/generic/load_average@0.1',
  'desc' : 'Load average is out of bounds',
  'data' : '1.32'

#### Messagebus message format:

Same as IPC with 2 fields added by master process:

  'time' : <unix timestamp in msec>,
  'host' : <hostname>

### dozor plugins

Every plugins runs as a separate process, connected to master process via IPC.

Plugin directory structure:

                                 index.js     => plugin bootstrap script
                                 plugin.js    => plugin body
                                 lib/         => optional library code

The overall plugin logic is: 

- to monitor constantly (if possible) or at given intervals for an
event to occur and, whenever an event is happened, send a 'fire' IPC
message to master; that is, plugin will send 'fire' messages when and
until the check condition is true


- load average plugin invokes 'uptime' utility every 5 seconds; in case
the LA value is greater than 1.00, plugin sends 'fire' message, so
during the high load period load average plugin will send 'fire' every

- file alteration plugin monitors for /etc/shadow file changes; in the
  event of file change it sends a single 'fire' event to master
  process, because the check condition is true only in the very moment
  of change, not after it

It is prefered to use available OS kernel interfaces which give the
information of interest ASAP.

#### Javascript plugin interface

Every plugin should be bootstrapped by index.js script in plugin directory.
The bootstrap file contains:

#!/usr/bin/env node

var plugin = require('./plugin.js');


plugin.js should export a Plugin object. Plugin object should be
prototyped from DozorPlugin with the added properties and methods:


- name : plugin name 
- version : plugin version
- category : plugin category
- fire_description : descriptive information for 'fire' events
- id : normally equal to this.category + '/' + this.name + '@' + this.version
- exec_input : array or object with additional data for 'exec' method


- exec : comes in two different flavors

a) an array containing filesystem path to binary/script and its
commandline arguments;
b) a function which emits 'fire' event

In case of "a" scenario (calling external binary/script), 2 additional
functions have to be defined:

- filter : function which is used to filter the output of
  binary/script, for example, to get the LA per minute from 'uptime'

- validate : function which returns true when data from filter
  conforms to fire event

## graphite collector

A bridge between dozor & graphite.

## alert collector

Read 'fire' events from numerous redis pubsub channels, aggregate,
filter and send alerts (http, xmpp). Only near-realtime alerting
services, e.g. no email.