HTTPS SSH

Before you begin

cl-dash is open source software. Details are in the license file LICENSE.txt.

If you use the software or want to refer to it, please cite the following paper:

Hodor P, Chawla A, Clark A, Neal L (2015) cl-dash: rapid configuration and deployment of Hadoop clusters for bioinformatics research in the cloud. Bioinformatics, doi:10.1093/bioinformatics/btv553.

Getting started with cl-dash

cl-dash is a kit for rapid configuration and deployment of Hadoop clusters on Amazon Web Services (AWS). It was designed for the specific purpose of supporting a dynamic research environment in which new tools can be developed and benchmarked. Included are two sample MapReduce applications from the domain of bioinformatics.

1. Set up the AWS account

1.1. Open an AWS account

If you do not already have an AWS account, sign up at https://portal.aws.amazon.com/gp/aws/developer/registration/index.html.

1.2. Create a user

From the main AWS page select "Identity & Access Management" (IAM) under the "Administration & Security" section. Create a user and generate an access key for the user, e.g. "cl-dash-user", by clicking "Users" then "Create New Users". Download and save the credentials, e.g. as "credentials.csv". The file is comma delimited and contains 3 columns that look like this:

User Name      Access Key Id   Secret Access Key
cl-dash-user   AKIA...         rW3u...

Continue on the IAM page and create a new group, e.g. "cl-dash-group", by clicking "Groups" then "Create New Group". Attach the "AmazonEC2FullAccess" and "IAMReadOnlyAcces" policies. Add the "cl-dash-user" to this group by clicking "Group Actions" then "Add Users to Group".

1.3. EC2 key pair

From the main AWS page click on the EC2 icon. Make sure you work in the US East (N. Virginia) region. Select "Key Pairs", then click "Create Key Pair". Choose a name for the key pair, e.g. "cl-key". Save the resulting file as "cl-key.pem". The contents of the file looks like this:

-----BEGIN RSA PRIVATE KEY-----
MIIE...
...
-----END RSA PRIVATE KEY-----

1.4. Security groups

From the main AWS page click on the EC2 icon. Check that you are in the US East (N. Virginia) region. Select "Security Groups".

Create a security group for ssh login. Call it "ssh and ping". Add the following inbound rules:

Type              Protocol       Port Range  Source
SSH               TCP            22          0.0.0.0/0
Custom ICMP Rule  Echo Request   N/A         0.0.0.0/0

Create a security group for the Hadoop nodes. Call it "hadoop cluster". Add the following inbound rules:

Type             Protocol  Port Range     Source
SSH              TCP       22             0.0.0.0/0
Custom TCP Rule  TCP       2181           0.0.0.0/0
Custom TCP Rule  TCP       2888           0.0.0.0/0
Custom TCP Rule  TCP       3838           0.0.0.0/0
Custom TCP Rule  TCP       3888           0.0.0.0/0
Custom TCP Rule  TCP       8000           0.0.0.0/0
Custom TCP Rule  TCP       8030 - 8033    0.0.0.0/0
Custom TCP Rule  TCP       8042           0.0.0.0/0
Custom TCP Rule  TCP       8080           0.0.0.0/0
Custom TCP Rule  TCP       8088           0.0.0.0/0
Custom TCP Rule  TCP       9000           0.0.0.0/0
Custom TCP Rule  TCP       10020          0.0.0.0/0
Custom TCP Rule  TCP       11000 - 11001  0.0.0.0/0
Custom TCP Rule  TCP       11443          0.0.0.0/0
Custom TCP Rule  TCP       14000          0.0.0.0/0
Custom TCP Rule  TCP       19888          0.0.0.0/0
Custom TCP Rule  TCP       50000 - 50100  0.0.0.0/0
Custom TCP Rule  TCP       60000          0.0.0.0/0
Custom TCP Rule  TCP       60010          0.0.0.0/0
Custom TCP Rule  TCP       60020          0.0.0.0/0
All ICMP         All       N/A            0.0.0.0/0

Write down the group ID, e.g. "sg-b22ec4d5". Edit the group, adding the following rule, where the value of "Source" should be the group ID that you just recorded. You can add this rule by clicking "Actions", followed by "Edit inbound rules", then scrolling to the bottom of the rules and clicking "Add Rule".

Type      Protocol  Port Range  Source
All TCP   TCP       0 - 65535   sg-b22ec4d5

2. Prepare the "admin" server

The cl-dash command-line tools can be installed and run from any computer. For convenience, a preconfigured AMI is available on AWS. After adding your security creadentials it is ready to go.

2.1. Launch the "admin" EC2 instance

From the EC2 page (US East, N. Virigina) click "Launch Instance". From the "Community AMIs" tab search for "cl-dash admin" and click "Select". Launch a small instance, e.g. t2.micro, by clicking the checkbox for t2.micro, then clicking through the "Configure Instance Details". In the step where the security group is configured, choose the existing security group "ssh and ping", which was created above. Click "Launch", then when asked for a key pair, choose "cl-key" created above

2.2. Enter AWS credentials

Connect to the "admin" instance via ssh, with user name "ubuntu", and the identity file cl-key.pem from above. The public DNS of the isntance can be found on the AWS console. The ssh command may have the following form, where the strings in angle brackets need to be replaced with the correct values:

ssh -o StrictHostKeyChecking=no -i </path/to/>cl-key.pem ubuntu@<instance-public-dns>

Create file ~/.boto. Enter the following text, replacing the two key strings with your own key pair from file credentials.csv.

[Credentials]
aws_access_key_id = AKIA...
aws_secret_access_key = rW3u...

2.3. Copy key file to the admin server

Use scp to copy file cl-key.pem to the admin server.

scp -i </path/to/>cl-key.pem </path/to/>cl-key.pem ubuntu@<instance-public-dns>:.

Restrict access to it:

chmod 400 cl-key.pem

3. Start a Hadoop cluster

3.1. Configuration file

Make a copy of the sample configuration file /usr/local/lib/python2.7/dist-packages/cl_dash/example.yml in your local directory. Edit it as needed. The ID of the starter Amazon Machine Image is ami-af5d9ac4.

3.2 Launch the cluster.

The following command does a dry run of cluster creation.

cl-create -y example.yml

If there are no errors, launch the cluster with:

cl-create -y example.yml -a

Watch the progress of the launch process. If successful, the message "Cluster started" will be displayed. (In case of errors, any instances that were created by the script may have to be terminated from the AWS console. You can terminate instances by clicking the checkbox of the instances to be terminated, then clicking "Actions" -> "Instance State" ->"Terminate")

3.3. Viewing information about the cluster

The command

cl-list

shows an entry for the cluster created above:

   1 "Example Cluster" - running

To view more details about the cluster, enter

cl-display "Example Cluster"

In particular, make note of the first element of the public_dns array. This is the DNS of the master node. For example, in the output below, the DNS of the master node is ec2-54-88-190-102.compute-1.amazonaws.com.

  ...
  slave instances: ['i-1a8894b3']
  public_dns: ['ec2-54-88-190-102.compute-1.amazonaws.com', 'ec2-52-7-138-114.compute-1.amazonaws.com']
  instanceinfo: {'slave1': '172.31.50.125', 'master': '172.31.53.220'}

4. Run the sample applications

4.1. Users "ubuntu" and "hduser"

There are two users that the Hadoop cluster knows about. 1. "ubuntu", whose purpose is to log into the cluster 2. "hduser", who has the proper environment set up to interact with the cluster and run jobs.

Log into the master node of the Hadoop cluster with ssh as user "ubuntu". The DNS is the value read off in the previous section, and the identity file is cl-key.pem.

After logging in, switch users with the command:

sudo su - hduser

4.2. Unpack the demo archive

In the home directory of hduser there is an archive named demo.tgz. Unpack it with:

tar xzvf demo.tgz

4.3. Run the aplications

In directory demo there are two subdirectories with two sample applications. Each application can be run either single-threaded, or as a MapReduce job.

4.3.1. Amino acid frequencies

Run the single-threaded version by executing script ./run-standalone.sh. There are two MapReduce versions of the applications, which are run by executing ./run-streaming-1.sh and ./run-streaming-2.sh. The scripts take a few minutes to run. Look at the output in files output-standalone.txt, output-streaming-1.txt, and output-streaming-2.txt.

4.3.2. GWAS

Due to the larger volume of input of this application, there are additional steps for setup and collecting results from the GWAS analysis. Running the programs can take on the order of an hour.

For the single-threaded version execute:

./setup-standalone.sh
./run-standalone.sh
./get-results-standalone.sh

The output association file is output-standalone/data.assoc.

Execute the MapReduce version with:

./setup-streaming.sh
./run-streaming.sh
./get-results-streaming.sh

The output association file is output-streaming/data.assoc.