Allow tools and deployers to specify optional Docker-based dependency resolution.

#401 Merged at 646b587
  1. John Chilton

Testing it out:

  • Install docker (tough, but getting easier).
  • Copy test/functional/tools/catDocker.xml to somewhere in tools/ and add to tool_conf.xml.
  • Add <param id="docker_enabled">true</param> to your favorite job destination.
  • Run the tool.

Description and Configuration:

Works with all stock job runners including remote jobs with the LWR.

Supports file system isolation allowing deployer to determine what paths are exposed to container and optionally allowing these to be read-only. They can be overridden or extended but defaults are provided that attempt to guess what should be read-only and what should be writable based on Galaxy's configuration and the job destination. Lots of details and discussion in job_conf.xml.sample_advanced.

$GALAXY_SLOTS (however it is configured for the given runner) is passed into the container at runtime and will be available.

Tools are allowed to explicitly annotate what container should be used to run the tool. I added in hooks to allow a more expansive approach where containers could be linked to requirements and resolved that way. To be clear, the mapping process isn't implemented at all but there is a class ContainerRegistry that is instantiated, passed the list of requirements, and given the chance to return a list of potential containers.... That is how one could implement this if it is a priority.

From a reproducibility stand-point it makes sense for tool author's to have control over which container is selected, but there is this security and isolation aspect to these enhancements as well. So there are some more advanced options that allow deployers (instead of tool authors) to decide which containers are selected for jobs. docker_default_container_id can be added to a destination to cause that container to be used for all un-mapped tools - which will result in every job on that destination being run in a docker container. If the deployer does not even trust those tools annotated with image ids - they can go a step further and set docker_container_id_override instead. This will likewise cause all jobs to run in a container - but the tool details themselves will be ignored and EVERY tool will use the specified container.

Additional advanced docker options are available to control memory, enable network access (disabled by default), where docker is found, if and how sudo is used, etc.... These are all documented in job_conf.xml.sample_advanced.

Implementation Details:

Metadata is set outside the container - so the container itself only needs to supply the underlying application and doesn't need to be configured with Galaxy for instance. Likewise - traditional tool_dependency_dir based dependency resolution is disabled when job is run in a container - for now it is assumed the container will supply these dependencies.

What's Next:

If implementation is merged, much is left to be discussed and worked through - how to fetch and control what images are fetched (right now the code just assumes if you have docker enabled all referenced images are available), where to fetch images from, tool shed integration (host a tool shed docker repository?, services to build docker images preconfigured with tool shed depedencies?), etc.... This is meant as more of a foundation for the dependency resolution and job runner portions of this.

Comments (1)