Wiki

Clone wiki

CS5220-S14 / sse

SSE Programming

GCC is still not all that smart about how it makes use of the SSE units. The Intel compiler does a better job, but it still has trouble producing code that reaches the peak performance. However, if you tell either GCC or the Intel compilers where it makes sense to use SSE instructions, the compiler can then help you do the heavy lifting of efficiently scheduling those instructions.

The very inner loop of my matrix multiply kernel from S10 and F11 looked something like

/* Include SSE intrinsics */ 
#include <xmmintrin.h>

/* Compute x += A*b 
 *
 * The "restrict" keyword in C99 tells the compiler
 * there is no aliasing.
 *
 * Note that A and x must be aligned on 16-byte boundaries.  This can
 * be done where they are declared with something like:
 *
 *    double A[4] __attribute__((aligned(16)));
 *    double x[2] __attribute__((aligned(16)));
 */ 
void matvec2by2(const double* restrict A,  
                const double* restrict b, 
                double* restrict x) 
{ 
    /* Load each component of b into an SSE register.
     * The registers hold two doubles each; each double gets
     * the same value.  The __m128d type is the type used by
     * the compiler to mean "short vector with which we can
     * do SSE stuff."
     */ 
    __m128d b0 = _mm_load1_pd(b+0); 
    __m128d b1 = _mm_load1_pd(b+1);

    /* Load x into registers */ 
    __m128d xr = _mm_load_pd(x);

    /* Now, update x:
     *   Load columns of A into a0 and a1
     *   multiply by b1 (a vectorized op) to get sum0 and sum1
     *   add into x
     */ 
    __m128d a0   = _mm_load_pd(A+0); 
    __m128d a1   = _mm_load_pd(A+2); 
    __m128d sum0 = _mm_mul_pd(a0,b0); 
    __m128d sum1 = _mm_mul_pd(a1,b1); 
    xr = _mm_add_pd(xr,sum0); 
    xr = _mm_add_pd(xr,sum1);

    /* It's good to store the result! */ 
    _mm_store_pd(x,xr); 
}

In the current run, I use a different kernel, one that computes the product A*B where A and B are both 2-by-P and P-by-2 matrices. I use different storage layouts for A, B, and C, and much of the work in getting this version up and running went into rearranging storage. There is only one place where I do something quasi-magic, which is in the command to shuffle the entries of one of the vector registers. There are two ways to do this, and depending on how you set up the rest of your code, one or the other may be superior. If you use this code, it's up to you to decide which you want to use:

#include <nmmintrin.h>

/*
 * On the Nehalem architecture, shufpd and multiplication use the same port.
 * 32-bit integer shuffle is a different matter.  If we want to try to make
 * it as easy as possible for the compiler to schedule multiplies along
 * with adds, it therefore makes sense to abuse the integer shuffle
 * instruction.  See also
 *   http://locklessinc.com/articles/interval_arithmetic/
 */
#ifdef USE_SHUFPD
#  define swap_sse_doubles(a) _mm_shuffle_pd(a, a, 1)
#else
#  define swap_sse_doubles(a) (__m128d) _mm_shuffle_epi32((__m128i) a, 0x4e)
#endif

/*
 * Block matrix multiply kernel.
 * Inputs:
 *    A: 2-by-P matrix in column major format.
 *    B: P-by-2 matrix in row major format.
 * Outputs:
 *    C: 2-by-2 matrix with element order [c11, c22, c12, c21]
 *       (diagonals stored first, then off-diagonals)
 */
void kdgemm2P2(double * restrict C,
               const double * restrict A,
               const double * restrict B)
{
    // This is really implicit in using the aligned ops...
    __assume_aligned(A, 16);
    __assume_aligned(B, 16);
    __assume_aligned(C, 16);

    // Load diagonal and off-diagonals
    __m128d cd = _mm_load_pd(C+0);
    __m128d co = _mm_load_pd(C+2);

    /*
     * Do block dot product.  Each iteration adds the result of a two-by-two
     * matrix multiply into the accumulated 2-by-2 product matrix, which is
     * stored in the registers cd (diagonal part) and co (off-diagonal part).
     */
    for (int k = 0; k < P; k += 2) {

        __m128d a0 = _mm_load_pd(A+2*k+0);
        __m128d b0 = _mm_load_pd(B+2*k+0);
        __m128d td0 = _mm_mul_pd(a0, b0);
        __m128d bs0 = swap_sse_doubles(b0);
        __m128d to0 = _mm_mul_pd(a0, bs0);

        __m128d a1 = _mm_load_pd(A+2*k+2);
        __m128d b1 = _mm_load_pd(B+2*k+2);
        __m128d td1 = _mm_mul_pd(a1, b1);
        __m128d bs1 = swap_sse_doubles(b1);
        __m128d to1 = _mm_mul_pd(a1, bs1);

        __m128d td_sum = _mm_add_pd(td0, td1);
        __m128d to_sum = _mm_add_pd(to0, to1);

        cd = _mm_add_pd(cd, td_sum);
        co = _mm_add_pd(co, to_sum);
    }

    // Write back sum
    _mm_store_pd(C+0, cd);
    _mm_store_pd(C+2, co);
}

Updated