Wiki

Clone wiki

lab5 / Home

CSE 6230, Fall 2014: Lab 5, Th Oct 2: GPU Reduction [DUE: Th Oct 9]

In this lab, you will put the performance optimization concepts covered in class into practice. Your specific task is to reproduce the GPU reduction kernel case study results.

In the first part, you will read the and understand the code that has been provided for the naive implementation (reduceNaiveKernel, or version 1 from the lecture) and finish the non-divergent threading version (reduceNonDivergeKernel, or version 2 from the lecture). For each version, you must achieve performance that is within 90% of the numbers listed on the lecture slides.

In the second part, you will implement the other optimizations discussed in class (version 2-6).

You may if you wish work in teams of two. To simplify our grading of your assignments, each person should submit his/her own assignment; however, all team members may submit identical code. Be sure to indicate with whom you worked by creating a README file as part of your submission. (See below for details.)

Part 0: Get the assignment code

Use the same fork-checkout procedure from Lab 1. The repo you want is gtcse6230fa14/lab5. As a reminder, the basic steps to get started are:

  1. Log into your Bitbucket account.

  2. Fork the code for this week's lab into your account. The URL is: https://bitbucket.org/gtcse6230fa14/lab5.git. Be sure to rename your repo, appending your Bitbucket ID. Also mark your repo as "Private" if you do not want the world to see your commits.

  3. Check out your forked repo on Jinx. Assuming your Bitbucket login is MyBbLogin and assuming that you gave your forked repo the same name (lab5), you would on Jinx use the command:

#!bash
git clone https://MyBbLogin@bitbucket.org/MyBbLogin/lab5--MyBbLogin.git

Alternatively, if you figured out how to do password-less checkouts using ssh keys, you might use the alternative checkout style, git clone git@bitbucket.org:MyBbLogin/lab5--MyBbLogin.git.

If it worked, you'll have a lab5--MyBbLogin subdirectory that you can start editing. It should contain 8 files.

Part 1: Reduction with Non-divergent threads [do in-class]

Open and read the reduce.cu and driver.c file containing the skeleton code for the reduction kernel. It performs the operation

s = A[0] + A[1] + A[2] + ... + A[N-1]

using a tree-based approach, whereby each thread block reduces a portion of the array, and stores the partial sum back to a temporary array for further reduction.

All of the host code needed for this function has been provided, including the code for allocating memory and copying data to and from the device. All you have to do is fill in the blanks for the different kernels.

Source code for the naive version reduceNaiveKernel has already been provided as part of the skeleton code. Read it carefully since understanding that code will help you in writing the other implementations.

After reading the code for 'reduceNaiveKernelFill in the necessary code forreduceNonDivergeKernel` which optimizes the naive version to remove thread divergence, as discussed in class.

When you are ready to compile, type make clean; make which produces reduce. You can run this binary using the provided cuda.pbs file. You can change cuda.pbs to have the kernel reduce differently sized arrays. Make sure that your implementation works for other non-power-of-2 sizes of N.

Once you've got something working, use the usual add-commit-push steps to save this version of your repo on Bitbucket. You do not need to transfer it to us.

Part 2: More optimizations

For the second part of the lab, implement the other optimizations discussed in class (versions 2-6). For each kernel, the performance should be within 90% of the numbers listed on the lecture slides.

Your performance target for a guaranteed "A" this week is 100 GB/s for version 7, when reducing arrays of sizes greater than 16 million elements. Recall that this is a sufficient condition, but not necessary, to receive an "A".

Bonus: To get up to a 20% bonus ("A+"), you'll need to hit 120 GB/s (on M2090) or give a convincing argument why such a level of performance is impossible. (The pin bandwidth is 177 GB/s on M2090, so your argument will have to be very compelling.)

Updated