Clone wiki

lab6 / Home

CSE 6230, Fall 2014: Lab 6, Th Oct 9: UVA/UVM and MPI+CUDA

In this lab, you will apply unified virtual addressing in CUDA to add GPU acceleration to an MPI-based implementation of 1-D distributed matrix multiply.

Please work in teams of up to two.

What to turn in and when

Due to Fall Break, this lab is due on Fri Oct 17 4:30pm (not Th Oct 16 before class, as would normally be the case).

Please follow all instructions and answer all questions; submit your answers in a PDF document and commit any code you write as part of the repo you transfer back to us by the due date.

Per the usual class policy, late assignments will not be accepted.

Part 0: Get the assignment code

Use the same fork-checkout procedure from Lab 1. The repo you want is gtcse6230fa14/lab6. As a reminder, the basic steps to get started are:

  1. Log into your Bitbucket account.

  2. Fork the code for this week's lab into your account. The URL is: Be sure to rename your repo, appending your Bitbucket ID. Also mark your repo as "Private" if you do not want the world to see your commits.

  3. Check out your forked repo on Jinx. Assuming your Bitbucket login is MyBbLogin and assuming that you gave your forked repo the same name (lab6), you would on Jinx use the command:

git clone

Alternatively, if you figured out how to do password-less checkouts using ssh keys, you might use the alternative checkout style, git clone

If it worked, you'll have a lab6--MyBbLogin subdirectory that you can start editing.

Part 1: Pinned memory and unified virtual addressing in CUDA

Recall that you have thus far been using explicit copies to transfer data between the host (CPU) and device (GPU). In CUDA, this looks something like:

  const int n = ...; // array size
  const int num_bytes = n * sizeof (float); // no. of bytes
  float* A_cpu = (float *)malloc (num_bytes);

  float* A_gpu;
  cudaMalloc ((void **)&A_gpu, num_bytes);

  /* CPU -> GPU */
  cudaMemcpy (A_gpu, A_cpu, num_bytes, cudaMemcpyHostToDevice);

  /* Call a hypothetical kernel */
  sort <<< ... >>> (A_gpu, n);

  /* GPU -> CPU */
  cudaMemcpy (A_cpu, A_gpu, num_bytes, cudaMemcpyDeviceToHost);

However, it is possible to transfer data more easily and more quickly using something called pinned memory allocation. A pinned memory block is "locked" into memory and cannot be swapped to disk; because it will always be in memory-resident, we can on many systems do faster direct memory transfers between CPU and GPU memory.

The basic technique is simple. First replace malloc with a special call to the CUDA runtime that allocates pinned memory blocks; then, call CUDA kernels directly on these blocks. The example above would become:

  const int n = ...; // array size
  const int num_bytes = n * sizeof (float); // no. of bytes

  int alloc_flag = cudaHostAllocMapped | cudaHostAllocPortable;
  float* A_cpu_pinned;
  cudaHostAlloc ((void **)&A_cpu_pinned, num_bytes, alloc_flag);

  /* Call a hypothetical kernel: no explicit transfers required! */
  sort <<< ... >>> (A_cpu_pinned, n);

The trade-off with using pinned memory is that allocation and deallocation become significantly more expensive. For more information, see the link under "Additional resources" at the bottom of this page.

We've provided you with a sample program called This program reverses the elements of an array on the GPU, first using explicit copies and then again using implicit copies via pinned memory. Look at the routines, benchmarkReverseWithCopies() and benchmarkReverseWithoutCopies() to confirm how we apply the pinned memory technique. (It is sufficient to look for the comments, "Do one test run".) Next, compile this program, grab an interactive node, run the program, record the results, and release the interactive node:

  # Compile:
  $  make rev

  # Grab interactive node:
  $  qsub -I -q class -l nodes=1:gpu -d $(pwd)

  # [On node] Run the program on an array of size 2^24
  $  ./rev 16777216

  # [On node] Release interactive node

Question 1: Report the speed of data transfer using explicit copies vs. implicit copies on pinned memory blocks.

Part 2: Direct GPU-to-GPU communication

Recall that every Jinx node has two GPUs. Using the same technique as above, we can do implicit direct memory transfers between GPUs. The program implements a "ping-pong." It starts by allocating memory on each of the two GPUs---call them "GPU 0" and "GPU 1." Then, it copies from GPU 0 to GPU 1. Finally, it copies back from GPU 1 to GPU 0. The program has two such ping-pong codes: one that uses an explicit intermediate buffer on the CPU, and one that does not.

Open and look at two functions: pingpongBuffered() and pingpongDirect(). Notice that the directional flag, cudaMemcpyDefault. Indeed, when using cudaMemcpy(), you can always use this flag instead of specifying a direction, as the runtime will figure out where the data live and do the right thing.

Next, let's compile, grab an interactive node, run the program, record its output, and release the node, just as above:

  # Compile:
  $  make mgpu-dma
  $  qsub -I -q class -l nodes=1:gpu -d $(pwd)

  # [On node] Run the program on an array of size 2^24 and exit
  $  ./mgpu-dma 16777216
  $  exit

Question 2: Report the results.

You may also be interested in looking at the startup code, in the function main(), which show how to detect that there are multiple GPUs. Note that if you are writing a multi-GPU code, you may need to select a specific GPU for particular functions, such as cudaMalloc(). You can use something called cudaSetDevice() for this purpose. See the example or references below for details.

Part 3: Distributed matrix multiply (in 1-D)

The last exercise will be for you to apply the pinned memory technique to a distributed 1-D block-row matrix multiply code, which we have provided.

Start by opening mm1d.c and looking at the function, mm1d(). This function is more-or-less a direct translation of the 1-D block row algorithm from the Lab 6 notes slides.

Verify that you understand how the pseudocode maps to the concrete code shown in mm1d(). Note that the matrix-related operations consist of three routines: mm_create, which creates a matrix; mm_free, which deallocates a matrix; and mm_local, which performs a node-local matrix multiply. As you look at the code, for now consider these as black-boxes.

Next, note that we have provided two implementations: mm-blas.c for the CPU, and for the GPU. In both cases, for the node-local matrix multiply we use highly-optimized and node-parallelized vendor routines from Intel (the Math Kernel Library, or MKL) and NVIDIA (the cuBLAS). Start by compiling and running these implementations, to get a feel for their baseline performance. To compile, just run


Node: Please "make" on the jinx-login node, and then use pbs file to submit. You'll get en error when try to compile it on an interactive jinx node.

which builds all programs, and should in particular produce mm1d-blas and mm1d-cuda binaries, respectively. We've provided two jobs scripts as well, mm1d-blas.pbs and mm1d-cuda.pbs. These request two nodes and run the programs. To run them, simply submit these job scripts. Note that they both effectively request 1 MPI process per node (two MPI processes in total), and exploit intranode parallelism via the implementations of mm_local().

Question 3: After you run the codes, inspect the .o* job script output. Report the nodes used, problem size, effective GFLOP/s. Do this for both the CPU (MKL BLAS) and GPU (CUDA) programs.

Now modify mm_local, mm_create, and mm_free in to use pinned CPU blocks, to avoid the explicit CPU-GPU copies.

Note 1: If you need information about the cublasSgemm() call, refer to the "Additional resources" links at the bottom of this page.

Note 2: For now, do not worry about trying to use both GPUs. However, if you want extra credit, we will consider any effort to do so!

Question 4: Submit your (hopefully working) code by transferring your repository to us, as you've done in previous labs. Report the nodes used and effective GFLOP/s for your implementation.

UPDATE : The condition for an automatic "A" is that your matrix multiply with pinned memory must be faster than the default implementation we provided. We will be posting hints on Piazza about how to diagnose performance problems in your implementation if you find it is not faster.

Additional resources