Clone wiki

nagios / check_mk_plugins

The following addons are available and stable:

Transfer agent for pushing agent results TO the server. See README for documentation


A python CLI for remote accesses to the Check_MK multisite json interface - most work by alex who codes like a god, ideas by me :)

For use in scripting, or generally to save you logging into a web browser

The following check_mk plugins are available and stable:



This plugin is supposed to monitor / inventorize xen dom0 and domU systems.
It will currently track the VM status and also the memory usage.

It used to be able to switch between libvirt, xm and xl, but this added too much complexity and bugs. xm and xl will stay supported, libvirt is out (but feel free to look at the archives).

The plugin delivers the following checks:

  • xen.vms
  • xen.mem

Note about XCP / XenServer support:

Is unsupported and will not be added.
At $dayjob we can quickly implement this support as a consulting task. See the offical check_mk website for this.

(The uncounted Citrix XenServer bugs have cost me over 100 hours of sleep since 3.x and so I'm strongly against spending any more time with this chaotic thing that gives out random mac addresses, runs on billions of UUIDs and has all v6 support disabled....... I'm just not putting a second of my free time into this "software")



report ecc / chipkill memory and pci checksum error status on linux. The check depends on working EDAC memory controller drivers for your CPU. This is rather easy with whitebox AMD systems, harder for systems that have a full management processor or a rather new Intel CPU (2 years-ish is new), since these "CPU" drivers seem to always lag. Especially on enterprise distros.


Memory ECC error checking is implemented. Agent reports correctable vs. uncorrectable errors and memory type.

bonus points:
agent could also use configured ECC type to adjust severity
agent could also check if the DIMMs are labeled and include that info. In practice, this is not really implemented in Linux so this could be added by people who "got it right". PCI error checking is not implemented.

In practice PCI error reporting on Linux is at child(ish) stages and I'm not sure if it can even be relied on.

The following check_mk plugins are not or partially implemented:



This collection of checks will monitor all core pieces of Areca raid HBAs using their SNMP management interface.


Works for all things that I can test:
currently recognizes the hardware sensors like voltage. Disks with their enclosure slots, Raidsets and Volumesets are also recognized and alerts are done if something is out of the ordinary. Fan and BBU monitoring cannot trigger alerts / check since I don't have the hardware to test. I'm testing against an AR-1680ix-20, which has no fan and no BBU. Disk and Volume set failure is detected and % rebuilt is calculated. Need testers.



tracking of glusterfs storage health status


collected a lot of commands and prepared a frame for the check
not sure i want to continue, there's many errors that are not caught by the gluster status tools.
like... the check will be helpful by reporting warn/crit, but you could never trust an "OK" because you need much more checking to tell glusterFS is happy.
on the other hand they already handed out a t-shirt to someone who added monitoring support for a different monitoring system. good incentive although it would have to be more than one for a *great* agent :)



verify the virus database is up to date


identified usable commands and tested they're working
cannot continue as i'm not smart enough to think up a good caching scheme (must only check every X hours and out of band from agent run)



verify controller battery, lun and physical disk states


identified what's the smallest possible install of the utilities
dug out & recorded all commands needed
tested adding a lun
tested for quick run time
verified it's possible to build the check and there's no quirks

and my




I'm so building the ceph agent once it's time to do it.
Right now it would not be much good yet, ceph -s output is not enough for monitoring.
(need to be able to have a local scope for the status, need to identify which component is failed without going all debug-ish.

SO many more already.