<!-- -*- markdown -*- -->
Gproc is a new system, written in Go, that combines features of the LANL version of bproc (http://sourceforge.net/projects/bproc/) and the xcpu software (xcpu.org). Unlike bproc, gproc requires no kernel patch. Like both of them, it provides a process startup mechanism for lightweight cluster nodes which have only a small ramdisk as the root file system, with no local disks or NFS root at all. Lightweight cluster nodes can be very easy to manage if configured correctly (see, e.g., http://portal.acm.org/citation.cfm?id=1132314).
Gproc currently is only used by us to run cluster processes as root. If this is a problem for you you might want to wait to use it; multi-user use is in the works (it is not hard but Go did not in the early days support some things we needed for it to work).
Gproc provides a system for running programs across clusters. A command and a set of nodes upon which to run it are specified at the command line; the binary, any required libraries, and any additional files specified at the command line are packaged up and sent out to the selected nodes for execution. The libraries need for the binary are determined by gproc programatically. The outputs are then forwarded back to the control node. Currently, we do not support stdin for the tree of processes as bproc did; it's coming but we have not needed it yet.
Gproc can be used from one architecture type to run commands on another. The library determination will work even on, e.g., OSX, for binaries to run on an ARM. One needs to have a reasonable copy of a root file system for the other architecture locally. For example, to run on an ARM CPU from my OSX machine, I just use the -r switch with the path to the root file system for the ARM tree on my OSX laptop.
Files that are transferred to nodes are stored in a replicated root under a certain directory of the node (default /tmp/xproc). This keeps gproc's operations from interfering with the rest of the node. For example, to run /bin/date, the binary would be copied to /tmp/xproc/bin/date and any necessary libraries would go to /tmp/xproc/lib/. The current working directory is also created on the nodes and the remote process starts in it. The mount is a private mount and disappears when the remote process and all of its children exit.
At its most basic level, gproc needs at least a single "master" node and a single "slave" node; for testing purposes, these can even be the same system. The master, started with the "gproc m" command (see below), accepts connections from slave nodes (started with "gproc s"). The user then runs a "gproc e <nodes> <program>" command on the master; this sends a command over a Unix Domain Socket to the listening "gproc m" process, which then instructs the appropriate slave nodes to execute the specified program.
Gproc does not require configuration files; for setting up one million virtual machines, configuration files are simply impractical. The structure of the cluster is defined in a locale, for example "kane" or "kf", defined in the files "kanecfg.go" and "kfcfg.go", respectively. We do, however, allow for an optional configure file in JSON format.
Some example locales:
"kf" describes our KANE cluster as a flat network; when working with many nodes, this will be slow.
"kane" describes KANE as a tree; every 20th node is set as a rack master. The front-end transfers files to the rack masters, which then serve the 19 nodes under them.
"etchosts" is a useful locale for getting started with a small, simple cluster. See the section "Getting started with gproc" for info on how to use it.
Getting started with gproc
A simple, generic locale called "etchosts" (see etchostscfg.go for the source) has been provided for use with a small cluster. The only requirement is that you have the names of all your nodes defined in /etc/hosts on every node in the cluster, and that slave nodes must end in a number. You should also have an entry for "master", the node on which you will be running the "gproc m" command. Here's an example /etc/hosts:
In this case, 192.168.0.254 would be the head node, from which commands should be launched. The systems n1 through n4 will be slave nodes, which will execute the actual commands. Assuming you have actually named your head node "master", here's some sample commands to get started:
# this will start the master
master$ sudo rm -f /tmp/g && ./gproc_linux_amd64 -locale=etchosts m
#(now open a new window)
# copy to every slave
master$ for i in n1 n2 n3 n4; do scp -o gproc_linux_amd64 root@$i:; done
# start gproc on slaves
master$ for i in n1 n2 n3 n4; do ssh root@$i ./gproc_linux_amd64 -locale=etchosts s &> /dev/null &; done
# now run a command
master$ ./gproc_linux_amd64 e . /bin/hostname
# If we want to copy a file to every node's /root/, for example, we could run this:
master$ ./gproc_linux_amd64 -f="/path/to/somefile" e . cp /tmp/xproc/path/to/somefile /root/
When you are finished with gproc, you should be able to simply kill the master process; this will cause the slave processes on the various nodes to exit automatically.
Please see the rest of this document for more details on using gproc.
Node specification syntax (BNF)
<nodes> ::= <nodeset> | <nodeset> "/" <nodeset> | <nodeset> "," <nodes>
<nodeset> ::= <node> | <node> "-" <node>
<node> ::= "." | <number>
1-80 # Specifies nodes 1 through 80, inclusive
. # Specifies all available nodes (flat network) or all first-level nodes (tree)
1/. # Specifies all second-level nodes under the first level-1 node
# In the "kane" config, this would select the first 20 nodes
./. # All nodes, all levels. Note that this example is 2 levels, but there is no limit on depth.
Command line options
There are several important command line options for use with gproc, the most important of which are a set of one-letter options which set the mode for gproc. The basic mode options for gproc are:
gproc [switches] m
gproc [switches] s
gproc [switches] e <nodes> <command>
gproc [switches] i
"gproc m" starts the master process and should be executed on the front-end node. "gproc s" starts the slave process and should be run on every node you wish to control. "gproc e" is used to actually run a command on the specified nodes. "gproc i" provides information about the first level of nodes (support for deeper levels will be added eventually).
There are a number of switches which can modify the behavior of gproc; some of the most important ones are described here. Some only make sense in certain modes; each switch's appropriate mode(s) can be found in parentheses after the description. The default value for the option is listed as well.
* -localbin=false # If set, programs will be run from each slave node's local directories, rather than copying binaries from the node where "gproc e" was executed. (e)
* -p=true # If set, binRoot is mounted privately during execution. This prevents unwanted binaries and other files from sticking around in binRoot. (s)
* -debug=0 # Specifies the debug level, from 0 to 10. (s, m, e)
* -f="" # Comma-separated list of files to copy to the slaves along with the program being executed. (e)
* -binRoot="/tmp/xproc" # The location under which the binaries, libraries, and other files will be placed. Use the same value for this when running the master, slaves, and exec modes or else gproc will get confused. (m, s, e)
* -defaultMasterUDS="/tmp/g" # The master process puts a Unix Domain Socket into the filesystem; the "exec" stage then connects to this socket to send commands. (m, e)
* -locale="local" # The locale to use. (m, s)
* -cmdport="6666" # Which port gproc will listen on for incoming commands. (m, s)
* -r="/" # where to find binaries. To use an arm root, for example, one can say -r=/path-to-arm-root
Example usage with KANE locale (static tree)
For the best control, you'll want 4 shells open to the front-end node, cesspool. In this example, we'll run gproc on the first 80 nodes of KANE.
The copytokane script and the startkane command expect you to be using gproc_linux_amd64. If you need to use a different architecture, you'll have to modify them to taste. The startkane program is a simple utility for launching gproc on the specified nodes (from -l [low] to -h [high], inclusive). The final argument, 1 or 2, specifies which "level" of the tree is being launched.
Copy the gproc binary out to the nodes you'll be using:
sh copytokane.sh 1 80
Run the master:
# This command creates a socket in /tmp/g. If you wish to re-run the master, remove /tmp/g before running
./gproc_linux_amd64 -locale=kane -debug=3 m
Start the level 1 nodes:
./startkane -l 1 -h 80 1
Start the second level nodes:
./startkane -l 1 -h 80 2
Run your command:
./gproc_linux_amd64 e ./. /bin/date
When you're done, Ctrl-C the master process; all the slaves should then automatically exit.
** Sample Use Case **
The Megawin project at Sandia boots up to 200 Windows virtual machines on a single Intel i7 node. These virtual machines load from a 700 MB disk image file, which must be copied to each node, along with a 114 MB QEMU "suspend to disk" image which helps boot the virtual machines more quickly. Gproc is used to transfer these files to all 520 KANE nodes and run the script that starts the virtual machines; the commands used are shown here.
First, gproc is started as above:
sh copytokane.sh 1 520
./gproc_linux_amd64 -locale=kane m
./startkane -l 1 -h 520 1
./startkane -l 1 -h 520 2
Then, the files are copied. The script we use to start the virtual machines expects the files to be in /tmp/ramdisk, so we use gproc to transfer them and copy to the appropriate directory:
# The -f flag specifies files that should be copied along with the command
./gproc_linux_amd64 -f="/scratch/megawin2.img,/scratch/migrate" e ./. /bin/cp /tmp/xproc/scratch/* /tmp/ramdisk
Although there are 520 nodes in the cluster, the 2-level tree means that the front-end node only directly transfers files to every 20th node (26 nodes total). Once the first-level nodes have received their copies of the files, they then copy the files to each of the 19 second-level nodes underneath them.
Once the files have been transferred, it is a simple matter to run the script which starts the virtual machines on every node:
./gproc_linux_amd64 e ./. /scratch/megawin_kvm_start.sh
The outputs from this script are forwarded back from every individual node to the front-end node for monitoring by the operator.
KF locale (flat network)
This locale is not great at large scales, but it is very simple. All nodes are at the same level.
sh copytokane 1 80
./gproc_linux_amd64 -locale=kf -debug=3 m
./startkf -l 1 -h 80
./gproc_linux_amd64 e . /bin/date