Implement plugin framework (and plugins) for collecting data about runtime job execution.

#352 Merged at b24137a
Repository
galaxy-central-fork-1
Branch
default
Repository
galaxy-central
Branch
default
Author
  1. John Chilton
Reviewers
Description

Overview: Implement plugin framework (and plugins) for collecting data about runtime job execution.

Examples:

A simple example demonstrating the core, uname, meminfo, cpuinfo plugins. (Only admins will see these job metrics in the UI.)

Simple example of core, uname, meminfo, and cpuinfo plugins

A contrived multi-core example demonstrating job runtime resource usage statistics.

Contrived multi-core collectl example

Configuration: An example job_metrics_conf.xml.sample is included that describes which plugins are enabled and how they are configured. This will be updated for each new plugin added. By default no instrumentation or data collection occurs - but if a job_metrics_conf.xml file is present it will serve as the default for all job destination. Additionally, individual job destinations may disable, load a different job metrics file, or define metrics directly in job_conf.xml in an embedded fashion. See comment at top of job_metrics_conf.xml for more information.

Current limitations: This only works with job runners utilizing the job script module and the LWR (it utilizes the job script module on the remote server), hence it won't yet work with...

If a job_metrics_conf.xml is present and some jobs route to the above destinations - the jobs won't fail but annoying errors will appear in the logs. Simply attach a metrics="off" those these specific job destinations to disable any attempt to use metrics for these jobs and disable these errors.

Overview of Plugins:

More complete documentation on each plugin can be found in job_metrics_conf.xml.sample.

core - The core plugin and captures the highest priority data - namely the number of cores allocated to the job and the runtime of the job on the cluster and has no external dependencies. These two pieces of information alone should provide a much clearer picture of what Galaxy is actually allocating cluster compute cycles to.

env - Record runtime environment variables (specific or all) based on configuration.

cpuinfo - Record number of core or detailed information on each core. (Linux only.)

meminfo - Record total system and swap memory. (Linux only.)

uname - Record operating system details. (Linux only.)

collectl - Provide deep integration with collectl. (Linux only.)

Collectl (http://collectl.sourceforge.net/) is a powerful monitoring utility capable of gathering numerous system and process level statistics of running applications. The Galaxy collectl job metrics plugin by default will grab a variety of process level metrics aggregated across all processes corresponding to a job, this behavior is highly customiziable - both using the attributes documented below or simply hacking up the code in lib/galaxy/jobs/metrics.

Warning: In order to use this plugin collectl must be available on the compute server the job runs on and on the local Galaxy server as well.

Attributes (the follow describes attributes that can be used with the collectl job metrics element above to modify its behavior).

  • summarize_process_data: Boolean indicating whether to run collectl in playback mode after jobs complete and gather process level statistics for the job run. These statistics can be customized with the process_statistics attribute. (defaults to True)

  • saved_logs_path: If set (it is off by default), all collectl logs will be saved to the specified path after jobs complete. These logs can later be replayed using collectl offline to generate full time-series data corresponding to a job run.

  • subsystems: Comma separated list of collectl subystems to collect data for. Plugin doesn't currently expose all of them or offer summary data for any of them except process but extensions would be welcome. May seem pointless to include subsystems beside process since they won't be processed online by Galaxy - but if

  • saved_logs_path these files can be played back at anytime. Available subsystems - process, cpu, memory, network, disk, network. (Default process).

  • process_statistics: If summarize_process_data this attribute can be specified as a comma separated list to override the statistics that are gathered. Each statistics is of the for X_Y where X if one of min, max, count, avg, or sum and Y is a value from S, VmSize, VmLck, VmRSS, VmData, VmStk, VmExe, VmLib, CPU, SysT, UsrT, PCT, AccumT WKB, RKBC, WKBC, RSYS, WSYS, CNCL, MajF, MinF. Consult lib/galaxy/jobs/metrics/collectl/processes.py for more details on what each of these resource types means. Defaults to max_VmSize,avg_VmSize,max_VmRSS,avg_VmRSS,sum_SysT,sum_UsrT,max_PCT avg_PCT,max_AccumT,sum_RSYS,sum_WSYS as variety of statistics roughly describing CPU and memory usage of the program and VERY ROUGHLY describing I/O consumption.

Comments (2)