Overview: Implement plugin framework (and plugins) for collecting data about runtime job execution.
A simple example demonstrating the core, uname, meminfo, cpuinfo plugins. (Only admins will see these job metrics in the UI.)
A contrived multi-core example demonstrating job runtime resource usage statistics.
Configuration: An example job_metrics_conf.xml.sample is included that describes which plugins are enabled and how they are configured. This will be updated for each new plugin added. By default no instrumentation or data collection occurs - but if a job_metrics_conf.xml file is present it will serve as the default for all job destination. Additionally, individual job destinations may disable, load a different job metrics file, or define metrics directly in job_conf.xml in an embedded fashion. See comment at top of job_metrics_conf.xml for more information.
Current limitations: This only works with job runners utilizing the job script module and the LWR (it utilizes the job script module on the remote server), hence it won't yet work with...
CLI runner - CLI runner needs to be reworked to use the job script module anyway so GALAXY_SLOTS works - the LWR version of the CLI runner uses the job script module - this work just needs to be back ported to Galaxy.
If a job_metrics_conf.xml is present and some jobs route to the above destinations - the jobs won't fail but annoying errors will appear in the logs. Simply attach a metrics="off" those these specific job destinations to disable any attempt to use metrics for these jobs and disable these errors.
Overview of Plugins:
More complete documentation on each plugin can be found in job_metrics_conf.xml.sample.
core - The core plugin and captures the highest priority data - namely the number of cores allocated to the job and the runtime of the job on the cluster and has no external dependencies. These two pieces of information alone should provide a much clearer picture of what Galaxy is actually allocating cluster compute cycles to.
env - Record runtime environment variables (specific or all) based on configuration.
cpuinfo - Record number of core or detailed information on each core. (Linux only.)
meminfo - Record total system and swap memory. (Linux only.)
uname - Record operating system details. (Linux only.)
collectl - Provide deep integration with collectl. (Linux only.)
Collectl (http://collectl.sourceforge.net/) is a powerful monitoring utility capable of gathering numerous system and process level statistics of running applications. The Galaxy collectl job metrics plugin by default will grab a variety of process level metrics aggregated across all processes corresponding to a job, this behavior is highly customiziable - both using the attributes documented below or simply hacking up the code in lib/galaxy/jobs/metrics.
Warning: In order to use this plugin collectl must be available on the compute server the job runs on and on the local Galaxy server as well.
Attributes (the follow describes attributes that can be used with the collectl job metrics element above to modify its behavior).
summarize_process_data: Boolean indicating whether to run collectl
in playback mode after jobs complete and gather process level
statistics for the job run. These statistics can be customized
with the process_statistics attribute. (defaults to True)
saved_logs_path: If set (it is off by default), all collectl logs
will be saved to the specified path after jobs complete. These
logs can later be replayed using collectl offline to generate
full time-series data corresponding to a job run.
subsystems: Comma separated list of collectl subystems to collect
data for. Plugin doesn't currently expose all of them or offer
summary data for any of them except process but extensions
would be welcome. May seem pointless to include subsystems beside
process since they won't be processed online by Galaxy - but if
saved_logs_path these files can be played back at anytime.
Available subsystems - process, cpu, memory, network,
disk, network. (Default process).
process_statistics: If summarize_process_data this attribute can
be specified as a comma separated list to override the statistics
that are gathered. Each statistics is of the for X_Y where X if
one of min, max, count, avg, or sum and Y is a value
from S, VmSize, VmLck, VmRSS, VmData, VmStk, VmExe,
VmLib, CPU, SysT, UsrT, PCT, AccumTWKB, RKBC,
WKBC, RSYS, WSYS, CNCL, MajF, MinF. Consult lib/galaxy/jobs/metrics/collectl/processes.py for more details on
what each of these resource types means. Defaults to max_VmSize,avg_VmSize,max_VmRSS,avg_VmRSS,sum_SysT,sum_UsrT,max_PCT avg_PCT,max_AccumT,sum_RSYS,sum_WSYS as variety of statistics roughly describing CPU and memory usage of the program and VERY ROUGHLY describing I/O consumption.
That would be incredibly useful for us, we need job runtime statistics here at CRS4. Please go ahead!
Thanks - I will merge after release Monday. Also thanks for agreeing to make CRS4 my beta tester :).