Clone wiki

lab3 / Home

CSE 6230, Fall 2014: Lab 3, Thursday, September 18: CUDA

Note: This lab is due Friday, September 26 at 4:35pm EDT.

In this lab, you will implement list ranking in both Cilk Plus and in CUDA.

You may if you wish work in teams of two. To simplify our grading of your assignments, each person should submit his/her own assignment; however, all team members may submit identical code. Be sure to indicate with whom you worked by creating a README file as part of your submission. (See below for details.)

Lastly, note that we are using two compilers this time (Intel's icpc for Cilk Plus code and NVIDIA's nvcc for CUDA code). You may need to add an additional line in you ~/.bashrc file

export LD_LIBRARY_PATH=/opt/cuda-4.2/cuda/lib64:$LD_LIBRARY_PATH

Part 0: Get the assignment code

Use the same fork-checkout procedure from Lab 1. The repo you want is gtcse6230fa14/lab3. As a reminder, the basic steps to get started are:

  1. Log into your Bitbucket account.

  2. Fork the code for this week's lab into your account. The URL is: https://bitbucket.org/gtcse6230fa14/lab3.git. Be sure to rename your repo, appending your Bitbucket ID. Also mark your repo as "Private" if you do not want the world to see your commits.

  3. Check out your forked repo on Jinx. Assuming your Bitbucket login is MyBbLogin and assuming that you gave your forked repo the same name (lab3), you would on Jinx use the command:

git clone https://MyBbLogin@bitbucket.org/MyBbLogin/lab3--MyBbLogin.git

Alternatively, if you figured out how to do password-less checkouts using ssh keys, you might use the alternative checkout style, git clone git@bitbucket.org:MyBbLogin/lab3--MyBbLogin.git.

If it worked, you'll have a lab3 subdirectory that you can start editing.

The code is a skeleton for you to implement the parallel list ranking algorithm discussed in an earlier class (see slide 33 and beyond). As discussed in class, it assumes an array pool representation of linked list, in which we store the list using an array to hold values and an array to hold indirect indexes, rather than the traditional pointer-based structure approach typical in C or Java.

The essential contents of the code distribution are as follows:

  • driver.cc, the main driver program: The driver is designed to test and time both a sequential implementation, which we've provided, as well as a parallel implementation, which you will provide.
  • listrank.hh and listrank.cc: These files define an interface and implementation of a sequential list ranker.
  • listrank-par.hh: This file defines an abstract interface for a parallel list ranker. We explain this in more detail below.
  • listrank-cilk.cc: This is a Cilk Plus skeleton implementing the parallel list ranker interface. It is incomplete; you will finish it in Part 1.
  • listrank-cuda.cu: This is a CUDA skeleton implementing the parallel list ranker interface. It is also incomplete; you will finish it in Part 2.

The remaining files provide additional supporting code.

The most important interface to understand is listrank-par.hh. It names an abstract data structure, ParRankedList_t, which holds the parallel implementation's initial and computed list ranking results. It also divides the list ranking operation into three separate steps:

  • setupRanks__par(): This creates a ParRankedList_t object, given a linked list in the array pool representation.
  • computeListRanks__par(): This performs the list ranking, presumably in parallel.
  • getRanks__par(): After computing the list rank, a caller may use this routine to see the computed rank values.

The overall interface is a little more complicated than for the sequential case, for two reasons. First, the use of the abstract data type, ParRankedList_t, allows different backend implementations to manage internal state without exposing that to the user. For Cilk Plus, this is not really needed, but for CUDA, this interface hides the fact of separate data allocation and copying needed to use the GPU. Secondly, when you implement and time the GPU implementation, the interface allows us to measure the time to do CPU-GPU data transfer separately from the time to do the computation. Thus, we can see the extent to which the GPU can speed up the computation and how compare that to the overhead of copying.

The driver program only assumes this interface. The Makefile we've provided builds separate programs for different parallel backend implementations, one for Cilk Plus and another for CUDA.

Part 1: Implement a parallel list ranker using Cilk Plus

The contents of lab3 directory of the code are as follows:

  1. The Makefile is set up to build both a Cilk Plus program and a CUDA program. You can use make to compile both the two codes, or use make listrank-cilk to compile the Cilk Plus code and use make listrank-cuda for the CUDA code.
  2. The file in which you should create your Cilk Plus-based implementation is called, listrank-cilk.cc.
  3. To build the driver that includes your implementation, run make listrank-cilk, which on success produces a binary executable called, listrank-cilk.
  4. To run the executable, you must supply two parameters: the list length (i.e., number of linked list nodes) and a number of timing trials (e.g., you might use a small number for your testing runs and a larger number like 10 or so for timing runs).

The meaning of the output numbers is explained in "driver.cc". The numbers are N, num_trials, bandwidth in order. The following numbers give the running time of the main computation part, preparation part, and post-computation part. Each part gives four time numbers: minimum, medium, maximum, and mean. Notice: the most important number is the 4th one, which is the effective bandwidth (in GigaBytes/second) of this program.

The Jinx cluster has 30 nodes, 24 of them have GPUs. Use the gpu attribute to request a GPU node. To simplify testing and debugging, we will be making interactive node requests this week. That means you will log directly into the node, as instructed below. However, there is a time limit for interactive node usage; moreover, please keep your interactive time as short as possible, since there will be many of your peers vying for time on these nodes.

To request a interactive job, rather than a batch job, you do not create a batch script. Rather, simply invoke qsub with the argument -I (for "interactive"):

  qsub -I -q class_long -l nodes=1:gpu -d $(pwd)

Note that following:

  • The nodes=1:gpu flag requests 1 node but also specifically asks for a node with a GPU.

  • The -d $(pwd) flag is a little trick that will make sure your current directory after you are logged into the node is the same as the current working directory where you issued the qsub command.

Once you've tested logging into a node interactively, please exit so as to not hog a node.

Target: We set the performance target as 0.1 GB/s for Cilk code when N=10,000,000. You can comment "-g" in the Makefile to get better performance (may be not). Don't worry if Cilk's performance is worse than sequential code. :)

Once you've got something working, use the usual add-commit-push steps to save this version of your repo. You do not need to transfer it to us just yet, until you have completed Part 2 below.

Part 2: Implement a parallel list ranker using CUDA

The contents of lab3 directory of the code are as follows:

  1. The Makefile is set up to enable both Cilk Plus code and CUDA code. You can use make to compile both the two codes, or use make listrank-cilk to compile the Cilk Plus code and use make listrank-cuda for the CUDA code.
  2. The file in which you should create your CUDA-based implementation is called, listrank-cuda.cu.
  3. To build the driver that includes your implementation, run make listrank-cuda, which on success produces a binary executable called, listrank-cuda.
  4. You can run this executable in the same way as listrank-cilk.

Implement an CUDA-based version of the parallel list ranking algorithm in the file named listrank-cuda.cu. You can refer to the lecture slides and the "Info on CUDA" to build a CUDA-variant according to the Clik code. But it's harder compared with translating from Cilk Plus to OpenMP. For the programming model of CUDA code, please check Chapter 2 of "Info on CUDA". Hope it inspires you.

We paste a simple example here. The following code adds two matrices A and B of size NxN and stores the result into matrix C.

Screen Shot 2014-09-14 at 12.51.07 AM.png

The output of CUDA code is the same with Cilk's. The time of preparation part is to time the data transfer from CPU to GPU, and the time of post-computation part is to time the data transfer from GPU to CPU. You can see the time numbers of both parts are much larger than the two parts in Cilk code.

Once you've got something working, use the usual add-commit-push steps to save this version of your repo. And then transfer it to us.

Target: The target for CUDA code is 0.3 GB/s when N=10,000,000.

If you find something interesting from your tests with different length of lists, please state them in the README file.

Additional Resources

Updated