For our effort described in the manuscript Why do Users Kill HPC Jobs?, we gathered the reasons why users kill their jobs on Beocat HPC cluster at Kansas State University. This repository contains the analysis scripts and data used in our effort.
- Unzip the bz2 files found in input and its descendant folders in the folders they found in.
- Make sure your system has sed, awk, and uniq commands.
scripts/masterScript.shfrom within bash shell.
We removed the names of all users who ran jobs and provided reasons for killing jobs from the data set via anonymization (when they occurred in identifiable positions).
- input/sge_accounting_file-08152016-12312017.csv contains the fragment of SGE Accounting File from Beocat cluster for the duration considered in our effort.
- input/aux-data/categories.txt maps categories of reasons to short ids.
- input/aux-data/raw-categorization.txt maps the reasons provided by users
to categories. Fields in order:
- short (reason) category id,
- username, and
- user provided reason.
- input/raw-feedback/output*csv ontains the reasons for job deletion as
qdelcommand. Fields in order:
- apps used in the job,
- reason for killing the job, and
- dump of info about job. This dump is a "!!!" separated record.
- input/raw-feedback/Job-Feedback-08152016-12312017.csv contains the
reasons for job deletion as captured via a web form. Fields in order:
- reason for killing the job,
- apps used in the job, and
- output/interim/raw-web-feedback.csv consolidates username, jobid, apps used in the job, and reason for killing the job from input/raw-feedback/Job-Feedback-08152016-12312017.csv.
- output/interim/raw-qdel-feedback.csv consolidates username, jobid, apps used in the job, and reason for killing the job from input/raw-feedback/output.csv*.
- output/interim/raw-consolidated-feedback.csv consolidates data from interim/raw-qdel-feedback.csv and interim/raw-web-feedback.csv.
- output/categorization.txt contains (re)anonymized categorization of reasons from input/aux-data/raw-categorization.txt. Reanonymization exists for historical purpose and to keep the changes to the earlier workflow minimal.
- output/feedback.csv contains (re)anonymized feedback/reasons from interim/raw-consolidated-feedback.csv.
- output/feedback-trend.csv contains the time trend of feedback/reason frequency.
- output/feedback-categories-trend.csv contains the time trend of feedback/reason categories (across groups of 1000 users).
- output/feedback-categories-stats.csv contains the total amount of cpu time, memory, io, and max vmem consumed by jobs killed for (a category of) reason.
- output/sge-time-info.csv contains total amount of wait and execution time (both wall clock and cpu) consumed by jobs exiting with specific exit code according to SGE accounting file.
- output/sge-based-qdel-time-info.csv contains wait and execution time (both wall clock and cpu) consumed by killed jobs and we have some info about the reason for killing them.
- output/sge-based-qdel-time-info.csv contains total amount of wait and execution times (both wall clock and cpu) consumed by jobs killed for specific reason.
Copyright (c) 2018, Kansas State University
Licensed under BSD 3-clause "New" or "Revised" License
Authors: Venkatesh-Prasad Ranganath