Wiki

Clone wiki

go-docker / FeaturesDetails

Components

go-docker works with several components:

  • go-docker: schedules, executes and manage job life cycle
  • go-docker-web: web interface and REST API to submit new jobs and get jobs information
  • go-docker-cli: command-line interface, using the REST API of go-docker-web

Other libraries/tools have been developed to interconnect go-docker with other tools, you can get all those tools at https://bitbucket.org/osallou/ (go-docker-XXX). You will find for exampe fireworks and airflow modules for workflows, DRMAA library, ...

Features

End user features

Job submission

This is the basic and main feature... User provides a shell script to be executed on a remote node. Additional parameters will specify the requirements for the job, like the number of cpu, the quantity of memory, the Docker image etc. When a node matching requirements is available, go-docker will schedule and submit it.

Among other things, it is possible to specify to mount in the Docker container some directories (the user home directory for example, or a shared directory containing some common software or data).

At the end of the job, user can get the job result, containing multiple metadata, such as job start, job end, exit status etc.

A per-job directory is also available in the container, where the user can write some data. Those files can be view/downloaded from the web interface or the CLI tool.

Interactive sessions

A interactive session is like a basic job but, instead of executing a shell script, it will start a SSH server. Once job is running, job can connect from his computer to the container using SSH with the SSH key declared in his settings. Multiple sessions can be opened at the same time.

This is useful for debug, manually executing a command, or accessing graphical software.

User must not forget to kill his job when he does not need it anymore.

Job kill

User can request a pending or running job to be killed. The software will manage the kill of the task as soon as possible.

Suspend/Resume

Depending on the execution system (Swarm), a job can be suspended and resumed later on.

Rescheduling

User can ask a job to be rescheduled. The job will be killed then put back in pending state. It will be scheduled later one like a new job.

Job replay

In the list of terminated jobs, once can click on "play" icon to create a new job using the terminated jobs as a template. A "new job" window will be open pre-filled with terminated job info.

Extending job lifespan

Jobs are executed with a maximum lifespan. After this period, job is killed. User can dynamically extend this lifespan (with a configured maximum) during the job execution if needed.

Job information

User can be current jobs list/details or query past jobs. Jobs in "archived state" do not have any more the files in the per-job directory, only job meta-data. User should take care to download/copy the generated files if they need them. The jobs are archived at regular interval (set by configuration).

Job dependencies

It is possible to specify that a job depend on one or more other jobs (parent jobs). In this case, the job will not be scheduled before the end of the parent jobs.

Job arrays

Job array are a way to submit one job N times. In this case, a parent job will be created (doing nothing) and N child jobs. All child jobs will execute the same command but additional environment variables will be available, such as the child job identifier (unique per child). The script can, for example, read the file named myfile_${GOGOCKER_TASK_ID}.txt

Batch submission

The command-line tool can submit multiple jobs in batch. This can be done for two reasons:

  • avoid the rate limit configured in go-docker (no more than X jobs running at the same time for a user)
  • you have to submit many many many jobs and this is not good for the system....

With the batch command, you can add some tasks to a batch list, then play the batch list with X jobs maximum at the same time. The tool will smooth the execution of the jobs and you can get a status of the play. It is also possible to suspend the execution and resume the playing later one.

Administrator features

Integration

To execute the jobs, go-docker can use Docker Swarm or Apache Mesos. Other tools can be added via the plugin mechanism.

Authentication

The authentication plugins takes in charge the authentication of the user as well the according ACLs. A ldap plugin is available but must be customized to your needs. The local will try to map system users, but user must be manually created in database first to get a password (not the system password), using the script available in seed directory.

Basically, it will pull user information (email etc.) and decide on job submission if a user can mount or not a directory in the container (with read-only or read-write access).

Projects

Projects are user groups. User are manually added to the projects. Projects can have priorities and quotas (see below).

Scheduler

The scheduler takes all pending jobs and reorder them. A FiFo (First In First Out) and a Fair share (equal dispatch) implementations are available. Other algorithms can be added with the plugin mechanism. Scheduler will also ask watchers (see below) is job can be started.

The Fair share algorithm will try to reorder the jobs according to previous usage of the user/project. Let's suppose we have 100 slots. If user A sends 100 jobs , then user B sends 1 job, in FiFo, B will have to wait then end of the job of A. With the Fair share policy, B job will be inserted in the job within the jobs of B. The policy algorithm will take into account the time, cpu and ram used in a previous defined period. It will then apply some weights to each condition to calculate a score (defined in configuration). Jobs are ordered according to this score. Users and Projects can also have priorities. If setting a higher priority, job will gain some ranks in the score.

Once jobs are ordered, scheduler will check for quotas and reject the job if quota is reached. At last, scheduler will ask to the execution system to execute the job on a node. To do so, the executor implements an algorithm (different for Swarm or Mesos), to submit the job to an available node, according to the specified requirements (cpu, ...)

Executor

The executors are in charge of managing the life cycle of the job, once it is running. Multiple executors can run at the same time (on one or multiple servers) to speed up the analysis of all jobs. Each executor will check a running job: is it still running, is a kill requested, ... and call watchers (see below). When a job is over, it updates the database with the job exit status and other information.

Watchers

Watchers are plugins (optional) defined in configuration. It is possible to develop new watchers to add new control features on the jobs.

The lifespan feature, for example, is a watcher plugin.

Watchers are triggered to know if a job can be started, and, once running, if it should be killed.

Max lifespan

Administrator can configure the default maximum duration of a job, as well as a maximum limit (user cannot extend duration above this limit). This prevents some jobs running forever....

Registries

Administrator can decide to limit the container image to a default list, or allow user to specify the container image they need. In the last case, administrator should consider security implications. It is also possible to allow the user to specify the image they want, but using a private registry. In this case, all requested images will be pulled down from this registry only.

Quotas

It is possible to define user and/or group quotas. Those quotas are the cumulative usage, during a defined period, of job time, cpus and ram.

An additional disk quota is defined. This is the maximum size of disk in the per-job directories used by a user. When limit is reached, next scheduled jobs are rejected with a "quota exceeded" reason.

Rate limiting

This option will limit the maximum number of simultaneous pending+running jobs per user. When the limit is reached, next jobs are rejected.

User root access

It is possible to let users execute a job or access a container with root container rights. This option should be carefully considered as there are always Docker container security considerations. Only set this option if you fully trust your users (your company developers, ...)

Cleaner

The cleaner job should be executed at regular interval (cron for example) to clean old jobs. Jobs will go to an "archived" status and the per-job directory will be deleted to free the disk space.

Updated