This project aims to simplify the dynamic creation of an independent hadoop cluster inside a docker swarm cluster.

What is this repository for?

The project consists of two components so far:

  • A customizable Dockerfile for the hadoop image
  • A setup script for a hadoop cluster (hdc) based on a property file


The Dockerfile folder contains the Dockerfile that comprises the hadoop image and configuration files for the different hadoop services. It also contains a build script that will automate the distribution of your hadoop image in your docker swarm cluster. It will also clean up existing tags/images with that exact name and version. The parameters for this script are as follows:

  • -u: The user this image is attributed to. This is needed to create the image url for the registry. The full url will be registry:port/user/image-name:version
  • -n: The name of the image.
  • -v: The version of the Image.

This also comes with a property file that contains the following properties:

  • DOCKERHOST: The name/ip and port of your docker swarm daemon, e.g. my-swarm-daemon:2377. Must be specified.
  • REPOSITORYHOST: The name/ip and port of your (local) docker registry, e.g. my-docker-reg:5000. If not specified, the image will not be distributed in your cluster.

The hadoop image is based on the java:8-jre image, so it is running on debian. The image is used for both the supervisor and worker nodes of your hadoop cluster and the distinction is made when spawning a container using that image, by specifiying "master" or "slave" as the command. The bootstrap.sh will then start the appropriate hadoop services. The container uses daemontools to manage the services and keep them running.

On the master container, the running services are

  • NameNode for HDFS
  • ResourceManager for YARN
  • JobHistoryServer for MapReduce
  • SOCKS proxy to allow access hadoop cluster

On any slave container, the running services are

  • DataNode for HDFS
  • NodeManager for YARN

The folder also comes with various hadoop configuration files which are used to build the image. Please make sure that the settings are appropriate to your needs. In particular, the available CPU and RAM ressources specified in yarn-site.xml should match the ressources reserved for the node's Docker container (see below). Otherwise they might go unused, or YARN might think that it has more resources than it really does.

Also note the variable "master.host". It is set to the name of the master container at runtime so the services can find each other. Note that for port binding reasons this is changed automatically to for any supervisor service's startup (NameNode, ResourceManager, JobHistoryServer). Also, for worker services (e.g. DataNode, NodeManager) the properties are set to

A SOCKS proxy is a single-port access point that relays requests to any URL with any protocol. In this case, the SOCKS proxy can be used as a proxy for a web browser to access the hadoop cluster's web interfaces or in an application to communicate with the IPC services of the hadoop cluster directly.

At this point there is only one NameNode and no SecondaryNameNode is started nor specified.

Hadoop Script

The hadoop script folder contains the command script hdc_cmd.sh and the property file hdc.properties. The script is used to setup and manipulate a hadoop cluster based on a property file. The script has the following commands:

  • create|add [<node_count>]: Creates a hadoop cluster from scratch or adds more nodes to it. If no node count is specified, only a master node will be set up, together with a private docker overlay network.
  • stop/restart [<container>...]: Stops/restarts all (specified) containers that match the hadoop cluster.
  • shutdown/recreate [<container>...]: Removes/readds all (specified) containers that match the hadoop cluster. The mounted in data of the container is preserved and a file is written to recreate the containers, e.g. to start them with different docker parameters.
  • clean [<container>...]: Removes all (specified) containers that match the hadoop cluster and cleans the permanent data. This is to clean up after one container or the whole hadoop cluster is no longer needed. This will free any resources occupied by the containers.

It also supports the optional flags listed below (compare to the properties beneath this list):

  • -o: The owner of the cluster
  • -i: The image to be used
  • -p: The property file to be used

The property file consists of the following properties to describe a hadoop cluster and can be either in hdc.properties or an arbitrary file specified with the -p prop-file flag:

  • DOCKERHOST: The name/ip and port of your docker swarm daemon. If not specified, the environment variable DOCKER_HOST might be used.
  • USER_OPTS: Your specific docker options that will be used for all containers, master and slaves alike.
  • _image: The image all containers will be started from. Can be overridden/specified by putting -i image as an option in front of the command.
  • _owner: The owner of the cluster. Will be used to generate the cluster name. Can be overridden/specified by putting -o owner as an option in front of the command.
  • _base_name: The base name for the cluster. All names will be derived from the owner and base name, e.g. "owner-base_name-master".
  • _docker_net: The network to be used for the cluster. Can be host or bridge or any overlay network (might be created on first setup). If not specified, will be set to "owner-base_name-net".
  • _mount_point: The mount point on the docker machine for the /data directory in the container. Should not be empty to make data from HDFS persistent.
  • _allow_on_master: Allow one slave to be started on the same container as the master. Is false if not set (default).
  • _master_cpus: The number of CPUs the docker swarm should reserve for the master container.
  • _master_memory: The amount of memory the docker swarm should reserve for the master container.
  • _slave_cpus: The number of CPUs the docker swarm should reserve for any slave container.
  • _slave_memory: The amount of memory the docker swarm should reserve for any slave container.

How do I get set up?

Cloning this repository

Clone this repository onto a machine where docker is installed and you have network access to the docker swarm cluster where you want to set up the hadoop cluster.

Configure property files

It is important that you configure the property files both for the docker build and the hadoop cluster.

  • Change the property file "Dockerfile_Hadoop/build_docker.sh" as described above, so the image can be built and possibly pushed/pulled.
  • Prepare the "Hadoop_Skripte/hdc.properties" or your own property file as described above, to specify the settings of your hadoop cluster.

Building the Dockerfile

First you need to build the docker image. Change hadoop configuration files in the "Dockerfile_Hadoop" directory as you need first, namely in the "-site.xml" files. Then call the build script with the desired parameters.

build_docker.sh -u user -n name -v version

Note that if the image version already exists, the old tag/image will be removed first. After building, the image will be pushed and then pulled to all docker nodes if a registry is specified.

Starting a Hadoop Cluster

Use the hdc.properties or your own property file to run the hdc_cmd script:

e.g. hdc_cmd.sh -o owner -p my.properties create 3

starts a hadoop cluster with 1 master container and 3 worker containers, using the properties defined in the file "my.properties". With docker port master-container you can now see which ports of your master container are tunneled to the outside world, represented by a randomly assigned port of the docker host that the container is running on. The master container is running a SOCKS proxy at port 1234 that can be used as a proxy e.g. for your browser to access the web interfaces of HDFS (default 50070) and YARN (default 8088), both accessible through the master node. With the other commands explained above you can manipulate your cluster, extending or shrinking it to the size you need.

Using SSH tunnels to access cluster

If the docker swarm cluster is protected by a gateway, you can create a SSH tunnel to your local machine to access the hadoop cluster. Say your master container is running on "docker-node-2" in your docker swarm cluster, with the container's port 1234 (the SOCKS proxy) mapped to the node's port 34000, and the docker cluster is behind the gateway "gw-docker" you can create a SSH tunnel on your local machine on port 60000 in the background with

ssh -M -S docker-socket -fNT -L 60000:docker-node-2:34000 gw-docker

This will create a socket file "docker-socket" which can be used to destroy the SSH tunnel with

ssh -S docker-socket -O exit gw-docker

While the SSH tunnel is up, you can use "localhost:60000" as the SOCKS proxy to your hadoop cluster.

Who do I talk to?

If you have questions, bug reports or anything the like, contact