CSE 6230, Fall 2014: Lab 1, Tu Sep 2: Dynamic multithreading in Cilk Plus

This page: https://bitbucket.org/gtcse6230fa14/lab1/wiki/Home
Chapter 27 of Cormen, Leiserson, Rivest, and Stein ("CLRS") on the multithreaded DAG model: http://mitpress.mit.edu/sites/default/files/titles/content/9780262033848_sch_0001.pdf
Intel Cilk Plus page: http://software.intel.com/en-us/articles/intel-cilk-plus/
Info on the Jinx cluster: http://support.cc.gatech.edu/facilities/instructional-labs/jinx-cluster

In this lab, you will implement a dynamic multithreaded version of the Quicksort algorithm in the Cilk Plus programming model. You will use the class cluster, which is called Jinx.

The lab has two parts. The first part is about how to get the starting code for the assignment, compile it, run it, and submit it, which we will do in class. Then, there is an "offline" part that you will do at home, submitting using the same procedure.

You may if you wish work in teams of two. To simplify our grading of your assignments, each person should submit his/her own assignment; however, all team members may submit identical code. Be sure to indicate with whom you worked by creating a README file as part of your submission. (See below for details.)

Part 0: Getting Started (about 30 minutes)

The head node of the Jinx cluster is: jinx-login.cc.gatech.edu. Use an ssh client and your GT login/password and log into Jinx now.

Some documentation regarding the cluster is available online:

http://support.cc.gatech.edu/facilities/instructional-labs/jinx-cluster

Please check the above documentation before reporting problems or issues.

Setting up your environment

Once you are logged in, you need to set up your environment to use the Cilk Plus compiler.

First, add the following two lines to the .bashrc file in your home directory. (Go ahead and create this file if it does not already exist.)

# Add these lines to ~/.bashrc, if not there already:
if [ -f /etc/bashrc ]; then
  source /etc/bashrc
fi

source /nethome/rvuduc3/local/jinx/setup-icc.sh

You may also need to add the following line to.bash_profile, which should also be in your home directory. (Again, create this file if it does not already exist.)

  # Add these lines to ~/.bash_profile, if not there already:
  if [ -f ~/.bashrc ] ; then
    source ~/.bashrc
  fi

Once you've done made these changes, log out (type exit) and log back in. If the changes took affect, then typing

  icpc --version

should display,

icpc (ICC) 12.1.3 20120212
Copyright (C) 1985-2012 Intel Corporation.  All rights reserved.

Get a Bitbucket account

This semester, we are trying an experiment: you will use an online code repository service, Bitbucket, for getting and submitting code.

The first step is for you to sign up for a Bitbucket account, if you haven't done so already. It is free to do so, and you can keep any code you maintain there private.

Fork test code

We've created a test repository on Bitbucket. First, log onto Bitbucket if you haven't done so already. Then, go to the following URL:

https://bitbucket.org/gtcse6230fa14/test-repo.git

You should see a screen that looks like the following.

Test repo site

This repository uses the git version control system. It contains a single file, called README, that you will edit and submit.

Click on the "Fork" button to create a copy of this repository within your Bitbucket account. You'll see the form below. In the "Name" box, please append a string consisting of a single hyphen followed by your Bitbucket ID. For example, Professor Vuduc's Bitbucket ID is rvuducbbtest1, so he would fill in the "Name" box as shown below.

Forking, Part 1: Name textbox

Note: In class, Professor Vuduc said to use two dashes instead of one, i.e., test-repo--rvuducbbtest1 rather than test-repo-rvuducbbtest1 as above. You can do either, but it turns out that it will be easier to stick with one dash. (If you already used two, that's fine but you may need to adapt some of the instructions below to make things work.)

You can leave most of the other form values at their defaults. However, we strongly recommend that you check the "This is a private repository" option, so that your changes and code do not become world-readable. (You should probably also check "Allow only private forks.") See the figure below.

Forking, Part 2: Mark private

Once you are ready, click the Fork repository button and hope for the best!

Check out your test repository on Jinx

Now that you have your own copy of the sample repository, let's make a copy on the Jinx cluster and modify it there. To do that, go back to Jinx (log in again if needed) and run the following command, where you replace "MyBbLogin" with your Bitbucket login ID:

git clone https://MyBbLogin@bitbucket.org/MyBbLogin/test-repo-MyBbLogin.git

If it works, you will see some output similar to the following:

Initialized empty Git repository in /nethome/MyGTLogin/test-repo-MyBbLogin/.git/
Password: << you enter your password here >>
remote: Counting objects: 3, done.
remote: Compressing objects: 100% (2/2), done.
remote: Total 3 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (3/3), done.

There will be a new subdirectory called test-repo-MyBbLogin. If you change into it and list its contents, you should see a single file, called README.

git basics: edit, commit, and push

Open the README file, which is a form asking for some of your details: name, Bitbucket login, and email. Complete the form.

When you've done so, run the following command to verify your changes:

git diff

Once you are satisfied, you should commit your change to your repository. You can commit to your local copy by adding each changed file that you want to keep, followed by uploading that change to the Bitbucket server:

# Add the changed file
git add README

# Commit -- saves locally only, with a message ("-m" option)
git commit -m "Filled out the form"

# Upload back to Bitbucket
git push

If you now go back to the Bitbucket site and refresh it, you should see an update.

Submit your test "assignment"

You need to submit this test assignment back to us, your instructors. To do so, you will transfer the repository back to us.

To transfer the repo, start by clicking on "Settings.

Modify repo settings

Next, click on "Transfer settings". When prompted, enter "gtcse6230fa14" as the username. To finalize the transfer request, click on "Transfer repository".

Transfer repo

If you've submitted the transfer request correctly, you should see the following screen:

![Transfer complete] (https://bytebucket.org/gtcse6230fa14/www/raw/2b22cb15a664f9d4369849c0ea0403dcc7f92114/figs/bb-test-repo-transfer-complete.jpg "The screen that appears once your transfer request is complete")

Note: Once you transfer and we, on our end, accept the transfer request, you will no longer have administrative access to your repository. So as a general rule, only do this step when you are ready to submit your assignments.

Checking out Lab 1 onto Jinx

Your task in this lab is to implement a multithreaded Quicksort using Cilk Plus.

Start by going to the Bitbucket URL for the Lab 1 repository:

https://bitbucket.org/gtcse6230fa14/lab1.git

Repeat the procedure from above to (a) fork a copy of the Lab 1 repository in your own Bitbucket account, and (b) checkout this copy on Jinx. Don't forget to append your Bitbucket login onto your forked code.

For the rest of this lab text, we will assume you are in the subdirectory containing the Lab 1 code.

Compiling the code

We have provided a small program, broken up into modules (separate C/C++ files and headers), that performs sorting sequentially. For this lab, you will make all of your changes to just one file as directed below.

We have also provided a Makefile for compiling your program. To use it, just run make. If the build succeeds, it will produce an executable program called qsort along with output that looks something like the following:

$  make    # <-- You type this command, and should see the following:
icpc  -O3 -g -o driver.o -c driver.cc
icpc  -O3 -g -o sort.o -c sort.cc
icpc  -O3 -g -o sequential-sort.o -c sequential-sort.cc
icpc  -O3 -g -o parallel-qsort.o -c parallel-qsort.cc
icpc -O3 -g -o qsort driver.o sort.o sequential-sort.o parallel-qsort.o

Run qsort on an array of size 100 as follows:

$  ./qsort 100   # <-- Command to run quicksort on an input of size 100

Timer: gettimeofday
Timer resolution: ~ 1 us (?)

N == 100

Sequential: 1.1e-05 seconds ==> 9.09091 million keys per second
    (Array is sorted.)
Parallel sort: 1.1e-05 seconds ==> 9.09091 million keys per second
    (Array is sorted.)
    (Arrays are equal.)

Running jobs on the cluster

The Jinx cluster is a shared computer. It has a head node, named jinx-login, and 30 compute nodes, named jinx1, jinx2, ..., jinx30. In the instructions above, you were using the head. You should limit your use of the head node to light tasks, such as file editing, compiling, and small test runs of, say, just a few seconds.

When you are ready to do a timing or "production" run, then you submit a job request to a central scheduler. The scheduler is called "Torque," which is an open-source version of a commercial package called "PBS." Torque coordinates, schedules, and runs all job requests.

To submit a job request, there are two steps:

Create a batch job script, which tells Torque what machine resources you want and how to run your program.
Submit this job script to Torque, using a command called qsub.

A batch job script is a shell script file containing two parts: (i) the commands needed to run your program; and (ii) metadata describing your job's resource needs, which appear in the script as comments at the top of the script.

We have provided a sample job script, qsort.pbs, for running the Quicksort program you just compiled on a relatively large input of size 10 million elements:

#PBS -q class
#PBS -l nodes=1
#PBS -l walltime=00:01:00
#PBS -N qsort

# Changes to the directory we were in when we
# submit the job:

cd $PBS_O_WORKDIR

# Runs a bunch of standard command-line
# utilities, just as an example:

echo "Script began:" `date`
echo "Node:" `hostname`
echo "Current directory: ${PWD}"

echo ""
echo "=== Running 5 trials of Quicksort on 10 million elements ... ==="
for trial in 1 2 3 4 5 ; do
  echo "*** Trial ${trial} ***"
  ./qsort 10000000
done

echo ""
echo "=== Done! ==="

# eof

Note the metadata, embedded as #PBS comments; Torque parses them to determine where and when to run your job. In this case, these options are as follows.

The -q option specifies the queue, which you should always set to class. (There are other queues for research jobs, which have lower priority, and instructors' jobs, which have higher priority but to which you must be a member.)
The two -l lines describe the desired resources. Here, nodes=1 says we only want a single compute node; walltime=00:01:00 says we want 1 minute of wallclock running time. If your job exceeds this running time, the central scheduler will automatically cancel it; on the other hand, if you request a very large amount of time, the scheduler may delay running the job until the cluster load is low.
The -N option gives our job a short name. The Jinx cluster documentation page describes these and other options in more detail.

The rest of the script contains a list of commands, which are more-or-less just as we would have typed them at the command prompt.

Once you have a job script, you can submit it to the Torque scheduler using the qsub command. The job will enter the central queue and eventually run, depending on its priority and the current load on the cluster. Go ahead and try this by typing the following two commands in immediate succession:

  qsub qsort.pbs
  qstat -a

The first command submits the job. It should also print the ID of your job, which you need if you want to, say, cancel the job later on. The second command, qstat, lists the contents of the central queue, so you can monitor the status of your job request.

For other useful notes and queue commands, such as qdel for canceling a job, see the Jinx cluster documentation.

When your job eventually runs, its output to standard output or standard error (e.g., as produced print statements) will go into output files (with .o### and .e### files, labeled by the job ID ###). Go ahead and inspect these outputs, and compare them to the commands in qsort.pbs to make sure you understand how job submission works.

Part 1 (in class, about 30 minutes): A first parallelization

Although we've given you a lot of code, to create a parallel Quicksort you just need to focus on editing parallel-qsort.cc. Right now, it just contains a "textbook" sequential version. Open the code and browse it, to make sure you at least understand the interfaces.

There are two parts you need to ultimately parallelize:

the partition step; and
the two recursive calls.

While you are here in class, leave the partition step (Part 1) sequential and parallelize the recursive calls (Part 2). Doing so is simply a matter of putting a _Cilk_spawn and _Cilk_sync in the right place. Step 1 is harder, and what you will do after class (see below).

Actually, you don't even really need a _Cilk_sync. Why?

After you change your code, don't forget to recompile (by running make) and submit a batch job to collect timing data. If you did it correctly, the '.e' file will not show any abnormal termination errors and you will observe performance on the order of 45+ million keys per second*.

Note: In the original version of this assignment, we claimed adding _Cilk_spawn would yield 15-20 million keys per second. That figure was based on last year's version of the assignment; we made some changes to the baseline code that give a bigger speed boost, hence this updated figure.

Once you are satisfied, add, commit, and share your implementation in the same way you did for the test repo.

Part 2 (after class): Parallelize the partition step [Due: Tu Sep 9 before class]

Parallelize the partition step, which the partition() function implements. Your performance target to guarantee a grade of "A" on this assignment is 55 million keys per second.

Recall that this is a sufficient condition to earn an "A", but is not necessary. We will take into account that you submitted a correct implementation and the amount of effort you put into your implementation.

Note that unlike parallelizing the recursive calls, parallelizing the partition step will not be a simple matter of inserting _Cilk_spawn, _Cilk_sync, and _Cilk_for keywords. You will need to come up with a different approach. And to hit the stated performance target above, you will most likely need to minimize your use of auxiliary arrays.

One solution we devised does not use any auxiliary arrays, that is, it is a parallel algorithm for partition() that is completely in-place.

Once you've created an implementation you are satisfied with, use the same add / commit / submission procedure as above. Your implementation is due before class on September 9.

Wiki

lab1 / Home