CSE 6230, Fall 2013: Lab 9, Tu Nov 5: Floating-Point SIMD

This page: http://j.mp/gtcse6230fa13lab9
Due date: ~~Tuesday, November 12, 4:30 PM~~ Friday, November 15, 4:30 PM
Lecture notes: SIMD Part 1, SIMD Part 2, Memory Optimizations
Info on the Jinx cluster: http://support.cc.gatech.edu/facilities/instructional-labs/jinx-cluster

In this lab you will practice using SIMD intrinsics to optimize floating-point computations. You will optimize two parts of the algorithm to compute eigencats (see eigenfaces). Eigencats are the eigenvectors of covariance matrix of cat images. Your assignment will consist of two independent parts: you will optimize conversion of image from fixed-point to floating-point format (from uint8_t pixels with values in [0, 255] to double pixels with values in [0.0, 1.0]) and matrix-vector multiplication (to compute eigenvalue using power iterations).

You may if you wish work in teams of two. To simplify our grading of your assignments, each person should submit his/her own assignment; however, all team members may submit identical code. Be sure to indicate with whom you worked by creating a README file as part of your submission.

Part 0: Getting started

Execute the following command to setup your environment and get the recent gcc (4.8.1), clang (3.3), and valgrind (if you plan to use Intel compiler, do not type this command in the same terminal session):

source /nethome/mdukhan3/install/envvars.sh

Fork the starting code for this lab and clone to get a local copy of repository.

The starting code implements naive versions of the algorithms to optimize, unit, and performance tests.

Part 1: Optimization of Fixed-Point to Floating-Point Conversion

In this part you will need to convert an array of grayscale images (all of the same size) from 8-bit unsigned integer format with range [0, 255] to double-precision floating-point format with range [0.0, 1.0].

Your task is to optimize the function convert_to_floating_point_optimized in image_simd.cpp. You are free to use any SIMD intrinsics, and compiler auto-vectorization options, but NOT multi-threading or CUDA.

Part 2: Optimization of Matrix-Vector Multiplication

In this part you will optimize multiplication of a matrix by a vector. You are free to rearrange the order of floating-point computations to benefit from SIMD (unit test uses interval arithmetics to ensure that results are in valid range).

Your task is to optimize the function matrix_vector_multiplication_optimized in image_simd.cpp. You are free to use any SIMD intrinsics, and compiler auto-vectorization options, but NOT multi-threading or CUDA.

Optimization remarks

You may use the following facts for the optimization

All arrays are aligned on 64 bytes.
The width and height of the images are multiples of 8
Intrinsic _mm_cvtepi32_pd converts two 32-bit integers to two doubles (albeit there could be other options to do this operation).

Grading

This assignment will be graded on performance (geometric mean of FPS in Part 1 and Part 2).
If some part of your code fails the unit test, this part will be graded based on performance of the reference code.
If valgrind catches an error in some part of your code, this part will be graded based on performance of the reference code.
Performance of 3000 FPS or higher (as measured on Jinx-login) guarantees A.

What to submit

Submit all changes to the code that you have made.
If you used non-standard (not g++) compiler or specified additional compiler flags, describe them in a README file. Otherwise your submission will be compiled with default parameters for grading.
Make sure your codes pass the unit tests and valgrind does not report errors in it.

Wiki

lab9 / Home