Wiki

Clone wiki

occam-web / Occam-Worker / GettingStarted

Occam Worker

The OCCAM Worker is a dispatcher that is able to create and run VMs to carry out workloads generated by the OCCAM tools and web portal. It is generally responsible for managing, dispatching, scheduling, and running OCCAM objects.

The worker nodes independently run one task at a time, although a task may spawn multiple processes. The worker nodes may fail, and will respawn as long as another worker witnesses the death. They assume only local machine points of view for file access and VM management, yet global database access within an OCCAM cluster for pulling jobs to run and committing results.

Starting a Worker

The OCCAM Worker node may run side-by-side with other identical nodes running on the same local machine. The number of concurrent tasks may never exceed the number of OCCAM Workers. Right now, Workers will not automatically spawn more as demand might indicate. They must be started by an administrator (likely through a script.)

Nodes apply recordkeeping in the form of files. It has one for the worker node, and one file per running job.

:::text
/occam-worker/workers/* # Worker recordkeeping
/occam-worker/running/* # Running Job recordkeeping
/occam-worker/jobs/*    # Job recordkeeping and all raw output

When a worker is started, it creates these directories if they don't exist, and places a file in the workers directory. Any jobs it spawns will place a marker in the running directory and also persist all output in the jobs directory.

Spawning a Job

When an OCCAM Worker has idle time, it pulls a job from the available tasks. It chooses only jobs without any unfinished dependencies and those it knows it can run locally. The job task list is a topological queue stored in the global database.

The job it picks will be executed on the local machine. It will create a VM environment if it needs to do so (it may cache some environments in the future, but at the moment there is no mechanism to do so.) Essentially, it starts with a base VM and attaches every object it requires to the filesystem. It may need to install libraries that are missing, which are specified by the object.

occam-worker.png

For example: If we have a workload that runs a simulator and a benchmark, the job would be written to run the simulator with that benchmark as an input. The simulator and the benchmark are both individual "objects" already known to the OCCAM system and already built. When these objects were built, they persisted on the machine as VM disk volumes containing their binaries. To spawn a job to run these, we can start with the VM image containing the simulator and "mount" the VM storage for the benchmark within that virtual system. Afterward, it will need to install the lingering libraries that the benchmark needs to run. This way, we can create VMs for specific workloads in a repeatable and maintainable manner.

When that VM image is constructed, it will be dispatched to the virtual machine manager, which in our case is Docker. If the job also needs co-running jobs, those will be built and dispatched in a similar way and executed alongside one another. These spawned processes do not have access to the global database and each run in isolation and with lessened privileges. Each job that is executed will drop a file in running with its unique job id. When the job completes, the dispatched process will delete that file and ping the worker that spawned it.

When the worker node sees that ping it will inspect the running directory for the files and when all spawned jobs have completed (they are all deleted), it will resolve the job and mark it as complete.

Install Job

Besides a generic run task, there are two other special types of jobs. This type of job consists of running all tasks related to installing an object. An object has an object.json descriptor that contains metadata about what code repositories or tar files, etc, that have to be downloaded. Upon installation, the worker will spawn a process that will download these files into the objects path set by an administrator.

When OCCAM pulls in a code repository during the install task, it is possible for the OCCAM instance to have this code repository be publicly accessible. Other OCCAM nodes can now download the same external objects from this OCCAM instance. When an OCCAM node imports this object from this instance, it will receive an updated object.json containing extra mirrors for the external files and code repositories. These will expand over time allowing objects to exist solely within the OCCAM federated system.

Build Job

Besides a generic run task, there are two other special types of jobs. OCCAM workers can spawn tasks related to building VMs for objects. These tasks use the build metadata contained within an object.json to know how to build a VM environment to contain that object. It will read the base image, boot that base image as a lightweight-VM container, and run the build scripts specified in object.json. It will also install any libraries specified in object.json as well. When the task completes, it persists the VM image and also stores the object's files as a volume that can be mounted into other VMs.

For example: If we were building the Sniper simulator, we would have an install task that pulls down the Sniper source code. We would then have a build task which will create a file volume, mount it into the base VM as a directory, install the libraries it needs to build and run, and build the Sniper software package. It would persist the entire VM and it will keep around the volume it created.

Job Failure

To be handled.

Node Failure

To be handled.

Updated