README: classh

classh is the "Cluster Administrator's ssh" tool. It is yet another wrapper around ssh for running commands on a number of hosts concurrently similar to xCAT, pssh, Cluster ssh, and a gaggle of other utilities.

The astute reader will already be asking: WHY release ANOTHER such tool?

A few years ago I need something like this and surveyed the tools available at the time. My requirements were:

  • Must handle thousands, preferably tens of thousands of targets
  • Must run a reasonable number (100s) of the jobs concurrently
  • Must be able to capture the output, error messages, and exit values for each job in a manner amenable to automated post-processing
  • Must be reliable enough not to stall, hang, nor crash "no matter what"
  • Must run on a standard Linux installation and be able to handle a variety of standard UNIX targets with no special client/agent/daemon software deployed (other than sshd).

At that time I also need it to be capable of handling interactive authentication (prompting for a password once, and automatically responding to ssh and sudo password prompts as necessary).

I wrote something in Python using os.fork(), os.execve(), Pexpect, and the signal handling module (to handle SIGCHLD events). It was a relatively ugly hack which we only used internally for the purpose at hand. Its output handling was crude. (Every child process simply wrote $(hostname).{out,err,ret} files into a specified target/job results directory).

However, it did the job and none of the other tools I reviewed at the time met all of my requirements. (It's possible that some of them have the features, but have them poorly enough documented or sufficiently inaccessible to a new user that I missed them).

classh is a re-implementation of that concept.


classh 'hostname;date' host1, host2, host3, ...

... will simply fire off up to 100 (default) subprocesses of ssh, each running the 'hostname' and date 'commands' on their respective hosts.

In this example if there were more than 100 hosts listed then after a 100 jobs were active classh would pause for a few tenths of a second, poll its pool of jobs for any that have completed, print any of the results, kill any jobs that stalled (5 minutes by default), and replenish the job pool until all the jobs were completed.

By default the output from each job would look like:

host2 0 (8.2)


host2.foo.xxx Fri Nov 20 03:39:32 PST 2009

host1 7 (2.6)

Error: date: not found



... and so on. (A number of alternative result displays are supported).

While classh defaults to incrementally printing results it also captures the output, error messages, exit value, start and end times of each; to facilitate sorting, writing into separate files, etc.

When the --progress switch is used then classh will print progress to stderr consisting of the following characters: .?~! (successful, remote error, ssh error, and killed/timeout respectively). In that case the other incremental output is skipped by default.

The --timeout option can over-ride the default job timeout value (300 seconds). (Note: extremely short timeout values --- less than 10 seonds can cause strange errors, 0 or any negative number will disable timeout handling completely).

For a more powerful example consider this:

classh -q -E ~/bin/remediate -S '~/bin/nextstage someargs ...'
'test -f /...' ./targets.txt

... which will quietly (-q) run the command "test -f /..." on every host listed in ./targets.txt and feed the names of each host that reports an error into a process running ~/bin/remediate while feeding all of the successful host names into another process which is running "~/bin/nextstage" with "someargs ..." as arguments.

In other words we can easily pipeline successful and exceptional results from one classh job into other processes (including, obviously, other classh commands).

The -S and -E options perform a bit of magic, - means classh's own stdout (for normal shell pipeline handling), a directory will be taken as a target for .{out,err,ret} files (.ret only in -E directories) an executable (or any string containing a space and starting with an executable filename) will be executed in a subshell (as described) and a regular/writable file will be opened for appending.

Similarly any of the trailing arguments that looks like a filename (contains a '/' character) will be treated as a list of host names or host patterns.

There is further magic in the hostname handling. Any argument that doesn't look like a filename and that does contain [...] expressions such as foo[0-10] or bar[3,2,12-23,40-44]baz ... will be expanded into a list like: foo0 foo1 ... foo10 or bar3baz bar2baz ... bar40baz. By default the same sort of numeric range expansion will be performed on each entry in file as it's processed.

The --noexpand option disables this expansion (in both arguments and file contents).


  • Runs configurable number of jobs in parallel
  • Tested on tens of thousands of targets per job
  • Record exit status, running time, output, and error messages separately
  • Supports timeouts (and records them)
  • Supports (optional) incremental results gathering/processing
  • Feed hostnames from successful and/or exceptional jobs into their own files or processes.
  • Flexible host pattern expansion (foo[1-20,31,32,40-100]bar.xxx)
  • Flexible options for saving output, errors, and exit values (including pickling all results for import)
  • Support interactive shell
  • Importable as a Python module: use to build more powerful scripts
  • Basic functionality in one file using only Python 2.4 std libs.

As a Module:

You can use the SSHJobMan class from classh in your own code. For example here's a simple program to test that the time reported on a list of hosts is consistent with the time on the localhost:

#!/usr/bin/env python import sys from classh import SSHJobMan from time import time

if __name__ == "__main__":

job = SSHJobMan(sys.argv[1:], 'date +%s') job.wait()

for host, res in job.results.items():
if res.exitcode:
print "Error getting date from %s" % host print res.errors print continue
rtime = int(res.output.strip())
except ValueError, e:
print "Couldn't parse output for %s" % host print res.output print continue
if (rtime - res.stopped) < 1 or res.started < rtime < res.stopped:
gap = res.stopped - res.started print "Time on %s is nominal (+/- %g)" % (host, gap)
print "Time error on %s" % host

Note that only three lines (one import and the first two lines in the __main__ block) are necessary to execute the job. The rest is just processing the results.

Rhetorical Questions:

  • Why not multiprocessing module?
  • Why not Twisted/conch?

Links to Related Packages:

Honorable Mention: