Wiki

Clone wiki

UMESIMD / Introduction and rationale

#!c

This piece of code was developed as part of ICE-DIP project at CERN.

 "ICE-DIP is a European Industrial Doctorate project funded by the 
 European Community's 7th Framework programme Marie Curie Actions under grant
 PITN-GA-2012-316596".

 All questions should be submitted using the bug tracking system or by sending e-mail to:

   przemyslaw.karpinski@cern.ch


// ***************************************************************************
// TABLE OF CONTENTS
// ***************************************************************************

 I.   Introduction
 II.  Why to use UME::SIMD?
 III. When not to use UME::SIMD?
 IV.  Performance
 V.   Compatibility
 VI.  Workflow - requesting help

// ***************************************************************************
// I. INTRODUCTION
// ***************************************************************************


    UME::SIMD is an explicit SIMD vectorization library for modern CPUs. 

    The library is implemented using C++ 11 and so requires a compliant compiler.

    Modern CPU architectures introduce concept of 'SIMD vector registers'. These registers
    are capable of packing multiple data elements and performing a single instruction on 
    all vector elements at the same time. Execution of SIMD code can bring a significant
    speedup over 'scalar code', that is code executing on one data element ( a 'scalar'
    element) at a time.

    'Explicit' vectorization refers to the software development process in which the
    programmer is aware of vectorization capabilities and writes the code so that it
    utilises underlying hardware. This approach contradicts so called 'auto-vectorization'
    in which the compiler is responsible for recognizing pieces of code subject for
    vectorization, and then performing certain optimizations resulting in generating
    vector instructions. In auto-vectorization model, the user doesn't have to be aware
    of vectorization on CPU instruction set level. Unfortunately the auto-vectorization is
    not (yet!) very good and so there is a need to have other, more direct means of
    interacting with SIMD hardware.

    There are multiple problems with actual support for vectorization on different CPUs. Few
    problems are:

    - Explicit vector programming requires from the user usage of assembly or 'vector
      intrinsic functions'. Both methods are not portable (over different CPU or even
      compilers). Effectively this makes it only possible to write short vectorized
      kernels instead of using vectorization on the same scale as regular scalar code is
      used. 

    - Not all simd vector types are supported in form of CPU SIMD registers. Since certain
      algorithms are only possible to execute using certain SIMD lengths, it is necessary
      for the user to create complicated workarounds.

    - Not all operations a user would like to perform are supported for given vector types.
      This is clearly a design flaw or engineering tradeof made during CPU design process. 
      Regardless of reason the users face a problem of developing workarounds repetitively.
    - It is not easy to write code that would work for both scalar and vector data types.
      Because the set of operations available on scalar types is different than set of
      operations available on SIMD types, the same code cannot be written for both scalar
      and SIMD types.

    - It is not easy to write code for which we could easily modify vector type used.
      Compiler intrinsic functions are forcing the user to write code in a non-portable
      way. Whenever a user wants to change vector length or base element type, he is forced
      to re-write whole piece of code. While it is not always possible to write code
      that executes the same way with different SIMD lengths, there are multiple occasions
      in which this can be necessary.

    - It is not easy to run vectorized code on a non-vectorizing CPU. Both inline assembly
      and compiler intrinsics require compilation with specific architecture-dependant
      compiler flags. This makes it necessary for the users to use compile-time directives
      to either include or exclude specific fragments of code. This decreases the code
      maintainability.

    - It is not easy to prepare vectorized code to be ran on future vectorizing CPUs.
      That means writing and debugging code that does exact the same thing, each time a
      new vectorizing CPU arises. 



// ***************************************************************************
// II. WHY TO USE UME::SIMD?
// ***************************************************************************

    UME::SIMD defines a set of hermetic data types that hide underlying vectorizing
    hardware from the user. While the library is using compiler intrinsics extensively, it
    is no longer necessary for the programmer to understand how these intrinsics map to the
    actual instruction sets. User sees only UME::SIMD types and has to operate only on these
    types. All types have well defined and wide interface so there is no need (except for
    some really extreme, low level optimizations) to understand intrinsics code and to
    understand the detailed capabilities of underlying hardware. While SIMD arithmetic
    itself is a little bit different from what most of the programmers are used to write, it
    is no longer complicated by the hardware complexity and level of support in different
    architectures.

    The library is introducing SIMD1 data types to be used in regular code. SIMD1 data type
    is essentially a code running on one scalar element at a time with one exception: the
    SIMD1 data container is able to use the same interface as other SIMD types!!! This makes
    it possible to write code only once and run it either as scalar or SIMD code. As some
    of the included microbenchmarks indicate, the performance of SIMD1 is very similar
    to performance of equivalent scalar code. While this relation doesn't hold for all
    algorithms, it still gives the users an additional tool for analysing slowdown
    resulting from using SIMD code. 

    By creating abstraction layer, it is possible to create workarounds for multiple 
    problems such as:
    - missing vector/ types, 
    - missing ISA instructions or some hardware resources.
    - missing intrinsic functions

    Missing some operations or resources could impact the performance. The library can give
    compile time guidelines to the user about potential problems with library performance.

    Because the interface is exactly the same for all data types, there is no longer problem
    in place that would forbid the user from writing reusable (e.g. templated) code. As
    presented in code examples, this library is pretty handy in providing means for code
    reusability.

    The library is very simple in use. All the users need to do is to include "UMESimd.h"
    file to their project and enable C++ 11 functionality (-std=c++11) in their compiler.
    The vectorization extension used is relying on some additional compiler flags, but
    the library will recognize them and select proper implementation without any additional
    modifications to the project. In case that code will be compiled without any
    vectorization enabled, the library will execute all operations in emulated mode and
    using array of scalar types to represent vectors. While this can be really bad for
    performance, the compilers are also very good in optimizing scalar code so the
    performance should be similar to one of regular code.

    Different CPU's use different instruction sets. Because explicit programming requires
    the user to write the code for all types of CPU 'explicitly', the UME::SIMD was designed
    so that it was possible (and relatively easy) to add new CPU's to the supported list.
    This can be done by writing a plugin and implementing whole interface for that specific
    instruction set. While number of instructions to be overriden is overwhelming at
    first sight, the existing interface classes limit the amount of code necessary to be
    written before the plugin can be used to only few hundred lines. Thanks to that the
    further development can be done incrementaly, and some minimum necessary capabilities
    enabled in matter of hours.

    One of platforms this library is targeting in the first place is Intel Xeon Phi. 
    Because of that the support will be provided on similar level as for Xeon processors.

    The library is released under MIT license. It is free for any type of application with 
    the limitation of preserving the original authorship information. You can copy,
    redistribute, modify, delete and do whatever you want with this code, for free. The
    license was chosen that way for few reasons:
     1) I believe that introducing vector types is necessary for future evolution of 
        compute and it shouldn't be blocked by the licensing problems. Because the library
        is "include like" it is necessary to prevent any license spoiling for any project
        that is potentially using it.

     2) This library can grow really large. Initial estimate is about 500000 lines of C++
        code with heavy use of intrinsics and template metaprogramming. Because of that it
        cannot be well developed without some community support. Without the proper feedback,
        no software component can ever achieve its full potential.

     3) This code is low-level enough to be useful in many domains, in both commercial and
        academic applications. Opening the source code for such library is a great opportunity
        to share some effort that would be beneficial for everyone.

     4) The project was established with support and main funding from European Commission
        and using money coming from the public budget. For that reason the code created for the
        project shouldn't be only limited to CERN and should be accessible to everyone.


// ***************************************************************************
// III. WHEN NOT TO USE UME::SIMD?
// ***************************************************************************

   There are few situations when you don't need/have to use this library:

     1) "I don't need more performance from my application." 

       It is usually easier to stick to regular scalar code and only optimize whatever is 
       performance critical. If your project only needs speedup in one critical algorithm
       on one specific platform it might be faster to just write intrinsic code. Although
       integrating UME::SIMD into a project is trivial, the compilation time will suffer due
       to extensive templatization usage. And of course there might be some bugs...

     2) "I want to program CUDA and other GP GPU devices."

       These devices have completely different approaches towards SIMD programming and
       overall hardware architecture. You can use UME::SIMD under some other abstraction
       layer to hide different hardware, but there are no plans on implementing a separate
       plugin for CUDA in this library. If you are interested in both CPU and CUDA, there
       is another library developed at CERN that has support for CUDA devices:

         https://github.com/VcDevel/Vc

       VC is pretty good in terms of performance on CPUs, but has some limited capabilities
       in terms of supported vector types.

     3) "I want the performance RIGHT NOW!"

       UME::SIMD is not yet mature with performance, although it should reach the top
       performance of other vectorization approaches in not-so-far future. If you still
       need some means to test your ideas, you can use existing explicit vectorization
       libraries such as Vector Code library (VC), Vector Class Library(VCL) or boost::simd:

         https://github.com/VcDevel/Vc
         http://www.agner.org/optimize/#vectorclass
         https://github.com/jfalcou/boost.simd

     4) "I don't need portability, I just want to program KNC (Intel Knight's Corner) with
         fancy vector classes". 

       Well there is a plugin for VCL that allows Vector code to be used on KNC that I
       developed a while ago. The code is merged with the original VCL and available at:

         https://bitbucket.org/edanor/vclknc_integrated

       The code acts the same way and uses the same approach as VCL. The VCL
       documentation from:

         http://www.agner.org/optimize/#vectorclass

       applies in general also for VCLKNC.

5) "I want to vectorize some standard containers without the changes of my code"

    Ther is YET ANOTHER explicit vectorization library: boost::simd available at:

       https://github.com/jfalcou/boost.simd

    This library targets improvements in boost and cooperation with existing boost
    components. While it doesn't provide solution for STL components, perhaps you could
    just switch to boost.



// ***************************************************************************
// IV. PERFORMANCE
// ***************************************************************************

    Vectorization concepts were introduced into CPUs for one reason, and one reason only: performance.
    This library is designed so that it is possible to extract as much as possible
    performance from CPUs, however it targets solving vectorization concepts in general.
    The number of supported data types is large, and it is larger than number of data
    types supported as SIMD register types in most of existing vectorisation ISA. The
    reason for that is to give the user the biggest flexibility in terms of software
    development as possible and to prepare code for execution on new architectures that
    might be available in 3-5 years.

    Developing intrinsic code for all (over 60!!!) vector types and for all supported
    instruction sets is a time consuming task. Instead of ENABLING FUNCTIONALITY, this
    library gives full programming interface using scalar types and will ENABLE
    PERFORMANCE over time.

    Using extensive scalar emulation makes it possible to compile UME::SIMD based code for
    all types of SIMD to test result correctness and to be able to perform development even
    on platforms that don't support vectorization! Since the interface is close to being
    complete (starting with first version of the library!), the functionality will not
    change over time (hopefully...).

    This will increase the portability of the code as well as code reusability and eliminate
    some of the problems that HPC programmers have to spend time on over and over again.



// ***************************************************************************
// V. COMPATIBILITY
// ***************************************************************************

    The code is compiled on a regular basis using following compilers:
     - MS Visual C++ compiler (CL v 17.00 and newer, available since MS Visual Studio 2012)
     - GNU g++ compiler (4.8.2 and higher)
     - INTEL C++ compiler (15.0 and higher)
     - CLANG++ compiler (3.8 and higher)

    In current version following instruction sets are supported and targeted for full
    implementation:
     - AVX + FMA
     - AVX2 + FMA
     - AVX512 (With planned support for Intel Xeon Phi: Knight's Landing (KNL) and Skylake (SKX) processors)
     - IMCI (Initial Many-Core Instructions, an Intel Xeon Phi: Knight's Corner
       coprocessor instruction set)

    // NOTE: Code compiled for SSExxx will compile to scalar emulation. This is because we
    // don't plan to support SSE instruction set as a standalone plugin. The baseline was set
    // on AVX. If you are desperate for SSEx support, please let us know.

    // NOTE: While current support is targeted for instruction sets developed for Intel
    // processors the interface can be developed also for other ISA (such as ARM NEON). If
    // you think there is a visible need to develop support for other instruction sets and
    // you are willing to spend your resources (time or/and money) on that, feel invited.

// ***************************************************************************
// VI. WORKFLOW - REQUESTING HELP
// ***************************************************************************

    The process of ENABLING PERFORMANCE will be performed over time. Unfortunately this
    can only be done by first specializing every possible combination of:

     1) SIMD vector type (over 60) 
     2) Instruction set (4 currently)
     3) operation (~250 operations per instruction simd type)

    this gives in total 60000 overloadable member functions! While not all of it has to be
    overloaded (e.g. SIMD1 and SIMD2 can be handed pretty well using scalar emulation)
    the amount resolves to really large code basis.

    The overloading of a single member function can take as little as 1 line, and as much as
    100 lines of intrinsic code. Adding to that some testing code (average unit test length
    is ~5 lines of code) this gives in total at least 3`000`000 lines of code (most of
    it using intrinsics).

    Because of limited resources of this project it is necessary to rely on community
    support. If you are a user and you would like to improve code basis, but you don't have
    time to do development, you can still help! All you have to do is to submit an issue in 
    our tracking system with [PERFORMANCE REQUEST] tag in the title. The issue should state:

     1. instruction set to be enabled
     2. data type to be optimized
     3. list of member functions that are required to be optimized
     4. number of cores (order of magnitude) you are targeting with your application

    // NOTE: as amdahl's law states, we should only be focusing on things that give the best
    // return. And so the tickets will be handled based on above information, 
    // prioritizing very small and easy to do modifications, and requesters with large
    // setups. Of course some exceptions still apply.
    //
    // If a similar issue already exists, please add a comment saying that you are also
    // interested in such improvement. This would help in effective prioritizing.

    The code might be buggy so far. This comes from the amount of combinations inside the
    library. Most of the bugs should be pretty easy to resolve and were not caught yet
    because of small number of unit tests developed (using above estimeates, the first
    target would be to have at least one unit test for each instruction of each SIMD type,
    but this is over 15000 unit tests, and a long way to go). If you find a bug in the
    software, please submit a ticket with [BUG] tag in the title. If you can provide list 
    of: 

      1. instruction set
      2. data type 
      3. member function 

    that cause a problem or implement a unit test that exposes the bug - that would be great!

    // NOTE: bugs will be treated with higher priority than [PERFORMANCE REQUESTS] because
    // they signify structural or functional problems. While performance is the goal of
    // SIMD programming, the correctness is still more important.

    There are some problems that we are not planning to resolve as:
     - supporting previous lower versions of compilers
     - implementing the interface in other languages
     - adding support for CUDA
     - overloading C++ operators to perform some operations

    If you think it is necessary to have certain functionality and it is still not there,
    please feel free to submit a ticket or ask a direct question.

Updated