Homework 1: Serial Matrix Multiply

Due: Friday, Feb 14 by 11:59 pm.

Problem

For this assignment, you will optimize a routine to multiply two double-precision square matrices. As discussed in class, the naive implementation is short, sweet, and horrifyingly slow. A naive blocked code is only marginally better. You will need to use what you have learned about tuning to get your code to run as fast as possible on a single core on one of the instructional nodes of the C4 cluster (a Intel Xeon E5504).

We provide:

A trivial unoptimized implementation and simple blocked implementation
A timing harness and tester
A version of the interface that calls the ATLAS BLAS

Implementation

Your function must have the following signature:

void square_dgemm(unsigned M, const double* A, const double* B,
                  double* C);

The three arrays should be interpreted as matrices in column-major order with leading dimension M. The operation implemented will actually be a multiply-add:

C := C + A*B

Look at the code in dgemm_basic.c if you find this confusing.

The necessary files are in the Bitbucket repository under the matmul subdirectory. You should have already checked out the repository; to make sure you have the most recent updates to the code, change to the directory you cloned into (probably ~/cs5220-s14) and run

git pull origin

Included in the directory are:

README.md: Overview of the code
Makefile: The build rules
Makefile.in.*: Platform-specific flag and library specs used in the Makefile
dgemm_*: Different modules implementing the square_dgemm routine
fdgemm.f: Reference Fortran dgemm from Netlib (c.f. dgemm_f2c)
matmul.c: Driver script for testing and timing square_dgemm versions
plotter.py: Python script for drawing performance plots
runner.sh: Helper script for running the timer on the instructional nodes

We will be testing on the 2.0 GHz Xeon machines on the C4 cluster. Each node has two quad-core chips, but you will only be using a single core for this assignent. See here for more information on the cluster.

Submission

Your group should submit ~~your dgemm.c, your Makefile and Makefile.in~~ a tarball matmul.tgz of your entire project source folder and a write-up (as a separate pdf).
Your write-up should contain:

the names of the group members
a description of optimizations used or attempted
the results of those optimizations
your explanations for any odd behavior (e.g. performance dips)

To document the effect of your optimizations, include a graph comparing your code with basic_dgemm.c. Your explanations should rely heavily on your knowledge of the memory hierarchy (benchmark graphs help).

Type tar -czvf matmul.tgz mm_kernel matmul to compress your submission on C4. The above command assumes that you execute it under the cs5220-s14/ folder. You may also want to use make realclean to remove unneeded files.

General strategy

Roughly speaking, a fast matrix multiplication routine will likely have two ingredients: a fast "kernel" capable of multiplying small matrices quickly, and then routines that use the kernel in a cache-friendly way to multiply larger matrices. One way to approach the assignment is top-down: sketch out the kernel first, and make sure you understand the global structure of the calls you want to set up, then start tuning parameters. Another way is bottom-up: build a fast kernel, play with parameters until you're happy with it; then build a matrix multiply for small-ish matrices (things that fit in L1 cache) out of the kernel, and play with parameters until happy; and then repeat for the L2 (and maybe L3) cache. I strongly recommend the second path, if only for the sake of sanity while debugging! It's easier to focus on getting a small thing right first than to try to get it all right on the first try.

In order to simplify the task of building a small kernel, I have included a kernel timing harness in the Bitbucket repository (under the mm_kernel subdirectoy).

Tactics

There are a few lower-level logistical issues that you should probably be aware of as you proceed. You may want to read my notes on serial tuning before going through these points, since some of them address things you have to know in order to implement my suggestions there using icc or gcc on the cluster.

Language standards and compile errors

The provided codes are written using the C99 dialect of the C programming language, and we use some features that do not occur in C89. Most modern compilers support the C99 standard (it has been 15 years!), but only if you ask them to do so. If you are using the provided Makefiles, you'll get the right behavior by default, since I've included the -std=gnu99 flag in the CFLAGS variable.

We also assume that the compiler supports OpenMP, though we only use it for the timing routines. This is the point of the -fopenmp flag in the LDFLAGS variable in the Makefiles. Unfortunately, Clang (the default compiler on OS X) does not support OpenMP by default.

SSE Programming

GCC is still not all that smart about how it makes use of the SSE units. The Intel compiler does a better job, but it still has trouble producing code that reaches the peak performance. However, if you tell either GCC or the Intel compilers where it makes sense to use SSE instructions, the compiler can then help you do the heavy lifting of efficiently scheduling those instructions. See further documentation here.

Feel free to use this code directly! More generally, you should feel free to use other codes that you find online, though you should provide an appropriate citation (e.g. via a comment). The details of how to use the SSE units aren't really the thing I hoped you'd get out of this -- and in any case, this code still needs some fiddling in order to get great performance.

Where arrays live in C

There are different types of places where variables can live in memory:

Automatic variables (the type you declare at the start of a function) live on the stack.
Global variables and static variables (defined with the static keyword in a function) live in the global space.
Constants often live in their own constant space.
Dynamically allocated data lives on the heap.

In a lot of day-to-day stuff, we use the stack for almost everything. But the stack is a finite resource! If you start allocating large arrays on the stack, you will very quickly have a stack overflow (and a program crash). This is probably not what you want, so I suggest using the following strategy if you want a little side buffer for copy optimization:

void my_function(...) 
{ 
    /* static means that this goes in the global segment 
     * (only one copy -- not thread safe!), and the alignment 
     * attribute is helpful if you're going to use my_space 
     * with SSE stuff.
     */ 
    static double my_space[BLOCK_SIZE*BLOCK_SIZE] 
        __attribute__((aligned(16))); 
    ... 
}

You can also dynamically allocate space on the heap -- but you won't want to do that anywhere near an inner loop! If you want to dynamically allocate aligned storage, the Intel compiler provides the _mm_malloc and _mm_free functions; for example

double* scratch = _mm_malloc(BSIZE*BSIZE*sizeof(double), 16);
// Do some stuff with the storage here...
_mm_free(scratch);

Timing woes

Timing on any modern CPU is a bit of a pain, for several reasons:

Not all systems provide easy access to a high-resolution clock. When they are accessible, the APIs used to access the system clock(s) are system-dependent.
The real time clock measures total time for a run, including time given to other processes. Though we hope that the C4 scheduler is good at not oversubscribing nodes, and this is not an issue, it is still something to watch out for.
There are per-CPU clocks, but these may differ across the machine, so that if your process gets scheduled on multiple processors, your timing gets goofy.

In the past, I have maintained my own tiny timer library, but by now, there really are better ways. Our code this year uses the omp_get_wtime function, one of the standard functions in OpenMP, to manage the timing. You should not use OpenMP to speed up your codes -- the point of the exercise is to get good serial performance -- but it's awfully convenient to use that standard wallclock time routine.

Further Resources

I have mentioned Piazza, right?
I have written some notes on serial tuning that may be helpful.
This assignment was shamelessly cribbed from the F11 assignment, which in turn borrowed from previous instances of this course and its ancestors into the mists of antiquity. Many of the resources listed on these related pages will be useful.
You can find out more about the processor by running Todd Allen's cpuid utility and by running cat /proc/cpuinfo. Note that you should do this using the batch queueing system, since the front-end node has a different processor than the workers.
What Every Programmer Should Know About Memory by Ulrich Drepper is a 114 page description of what Drepper considers to be memory basics. Good information on everything from the physics of memory hardware up through the logical organization
The Intel 64 & IA-32 Architectures Optimization Reference Manual is just what it says. You probably won't go through all 700-odd pages of this, but there are some chapters that you might find useful.
The optimizations used in PhiPAC and ATLAS may be interesting. Note: You cannot use PHiPAC or ATLAS to generate your matrix multiplication kernels. You can write your own code generator, however. You might want to skim the tech report on PhiPAC. The section on C coding guidelines will be particularly relevant.
There is a classic paper on the idea of cache blocking in the context of matrix multiply by Monica Lam, et al.
Several folks have tried to automatically get cache locality by basing their matrix storage organization around space-filling curves. See the paper by Chatterjee and the classic paper by Gustavson, "Recursion leads to automatic variable blocking for dense linear algebra". I don't necessarily recommend these optimizations on a time budget, but they make interesting reading.

Wiki

CS5220-S14 / HW1