Wiki

#!c

This piece of code was developed as part of ICE-DIP project at CERN.

"ICE-DIP is a European Industrial Doctorate project funded by the
European Community's 7th Framework programme Marie Curie Actions under grant
PITN-GA-2012-316596".

All questions should be submitted using the bug tracking system or by sending e-mail to:

przemyslaw.karpinski@cern.ch

// ***************************************************************************
// TABLE OF CONTENTS
// ***************************************************************************

I. Introduction
II. Why to use UME::SIMD?
III. When not to use UME::SIMD?
IV. Performance
V. Compatibility
VI. Workflow - requesting help

// ***************************************************************************
// I. INTRODUCTION
// ***************************************************************************

UME::SIMD is an explicit SIMD vectorization library for modern CPUs.

The library is implemented using C++ 11 and so requires a compliant compiler.

Modern CPU architectures introduce concept of 'SIMD vector registers'. These registers
are capable of packing multiple data elements and performing a single instruction on
all vector elements at the same time. Execution of SIMD code can bring a significant
speedup over 'scalar code', that is code executing on one data element ( a 'scalar'
element) at a time.

'Explicit' vectorization refers to the software development process in which the
programmer is aware of vectorization capabilities and writes the code so that it
utilises underlying hardware. This approach contradicts so called 'auto-vectorization'
in which the compiler is responsible for recognizing pieces of code subject for
vectorization, and then performing certain optimizations resulting in generating
vector instructions. In auto-vectorization model, the user doesn't have to be aware
of vectorization on CPU instruction set level. Unfortunately the auto-vectorization is
not (yet!) very good and so there is a need to have other, more direct means of
interacting with SIMD hardware.

There are multiple problems with actual support for vectorization on different CPUs. Few
problems are:

- Explicit vector programming requires from the user usage of assembly or 'vector
intrinsic functions'. Both methods are not portable (over different CPU or even
compilers). Effectively this makes it only possible to write short vectorized
kernels instead of using vectorization on the same scale as regular scalar code is
used.

- Not all simd vector types are supported in form of CPU SIMD registers. Since certain
algorithms are only possible to execute using certain SIMD lengths, it is necessary
for the user to create complicated workarounds.

- Not all operations a user would like to perform are supported for given vector types.
This is clearly a design flaw or engineering tradeof made during CPU design process.
Regardless of reason the users face a problem of developing workarounds repetitively.
- It is not easy to write code that would work for both scalar and vector data types.
Because the set of operations available on scalar types is different than set of
operations available on SIMD types, the same code cannot be written for both scalar
and SIMD types.

- It is not easy to write code for which we could easily modify vector type used.
Compiler intrinsic functions are forcing the user to write code in a non-portable
way. Whenever a user wants to change vector length or base element type, he is forced
to re-write whole piece of code. While it is not always possible to write code
that executes the same way with different SIMD lengths, there are multiple occasions
in which this can be necessary.

- It is not easy to run vectorized code on a non-vectorizing CPU. Both inline assembly
and compiler intrinsics require compilation with specific architecture-dependant
compiler flags. This makes it necessary for the users to use compile-time directives
to either include or exclude specific fragments of code. This decreases the code
maintainability.

- It is not easy to prepare vectorized code to be ran on future vectorizing CPUs.
That means writing and debugging code that does exact the same thing, each time a
new vectorizing CPU arises.

// ***************************************************************************
// II. WHY TO USE UME::SIMD?
// ***************************************************************************

UME::SIMD defines a set of hermetic data types that hide underlying vectorizing
hardware from the user. While the library is using compiler intrinsics extensively, it
is no longer necessary for the programmer to understand how these intrinsics map to the
actual instruction sets. User sees only UME::SIMD types and has to operate only on these
types. All types have well defined and wide interface so there is no need (except for
some really extreme, low level optimizations) to understand intrinsics code and to
understand the detailed capabilities of underlying hardware. While SIMD arithmetic
itself is a little bit different from what most of the programmers are used to write, it
is no longer complicated by the hardware complexity and level of support in different
architectures.

The library is introducing SIMD1 data types to be used in regular code. SIMD1 data type
is essentially a code running on one scalar element at a time with one exception: the
SIMD1 data container is able to use the same interface as other SIMD types!!! This makes
it possible to write code only once and run it either as scalar or SIMD code. As some
of the included microbenchmarks indicate, the performance of SIMD1 is very similar
to performance of equivalent scalar code. While this relation doesn't hold for all
algorithms, it still gives the users an additional tool for analysing slowdown
resulting from using SIMD code.

By creating abstraction layer, it is possible to create workarounds for multiple
problems such as:
- missing vector/ types,
- missing ISA instructions or some hardware resources.
- missing intrinsic functions

Missing some operations or resources could impact the performance. The library can give
compile time guidelines to the user about potential problems with library performance.

Because the interface is exactly the same for all data types, there is no longer problem
in place that would forbid the user from writing reusable (e.g. templated) code. As
presented in code examples, this library is pretty handy in providing means for code
reusability.

The library is very simple in use. All the users need to do is to include "UMESimd.h"
file to their project and enable C++ 11 functionality (-std=c++11) in their compiler.
The vectorization extension used is relying on some additional compiler flags, but
the library will recognize them and select proper implementation without any additional
modifications to the project. In case that code will be compiled without any
vectorization enabled, the library will execute all operations in emulated mode and
using array of scalar types to represent vectors. While this can be really bad for
performance, the compilers are also very good in optimizing scalar code so the
performance should be similar to one of regular code.

Different CPU's use different instruction sets. Because explicit programming requires
the user to write the code for all types of CPU 'explicitly', the UME::SIMD was designed
so that it was possible (and relatively easy) to add new CPU's to the supported list.
This can be done by writing a plugin and implementing whole interface for that specific
instruction set. While number of instructions to be overriden is overwhelming at
first sight, the existing interface classes limit the amount of code necessary to be
written before the plugin can be used to only few hundred lines. Thanks to that the
further development can be done incrementaly, and some minimum necessary capabilities
enabled in matter of hours.

One of platforms this library is targeting in the first place is Intel Xeon Phi.
Because of that the support will be provided on similar level as for Xeon processors.

The library is released under MIT license. It is free for any type of application with
the limitation of preserving the original authorship information. You can copy,
redistribute, modify, delete and do whatever you want with this code, for free. The
license was chosen that way for few reasons:
1) I believe that introducing vector types is necessary for future evolution of
compute and it shouldn't be blocked by the licensing problems. Because the library
is "include like" it is necessary to prevent any license spoiling for any project
that is potentially using it.

2) This library can grow really large. Initial estimate is about 500000 lines of C++
code with heavy use of intrinsics and template metaprogramming. Because of that it
cannot be well developed without some community support. Without the proper feedback,
no software component can ever achieve its full potential.

3) This code is low-level enough to be useful in many domains, in both commercial and
academic applications. Opening the source code for such library is a great opportunity
to share some effort that would be beneficial for everyone.

4) The project was established with support and main funding from European Commission
and using money coming from the public budget. For that reason the code created for the
project shouldn't be only limited to CERN and should be accessible to everyone.

// ***************************************************************************
// III. WHEN NOT TO USE UME::SIMD?
// ***************************************************************************

There are few situations when you don't need/have to use this library:

1) "I don't need more performance from my application."

It is usually easier to stick to regular scalar code and only optimize whatever is
performance critical. If your project only needs speedup in one critical algorithm
on one specific platform it might be faster to just write intrinsic code. Although
integrating UME::SIMD into a project is trivial, the compilation time will suffer due
to extensive templatization usage. And of course there might be some bugs...

2) "I want to program CUDA and other GP GPU devices."

These devices have completely different approaches towards SIMD programming and
overall hardware architecture. You can use UME::SIMD under some other abstraction
layer to hide different hardware, but there are no plans on implementing a separate
plugin for CUDA in this library. If you are interested in both CPU and CUDA, there
is another library developed at CERN that has support for CUDA devices:

https://github.com/VcDevel/Vc

VC is pretty good in terms of performance on CPUs, but has some limited capabilities
in terms of supported vector types.

3) "I want the performance RIGHT NOW!"

UME::SIMD is not yet mature with performance, although it should reach the top
performance of other vectorization approaches in not-so-far future. If you still
need some means to test your ideas, you can use existing explicit vectorization
libraries such as Vector Code library (VC), Vector Class Library(VCL) or boost::simd:

https://github.com/VcDevel/Vc
http://www.agner.org/optimize/#vectorclass
https://github.com/jfalcou/boost.simd

4) "I don't need portability, I just want to program KNC (Intel Knight's Corner) with
fancy vector classes".

Well there is a plugin for VCL that allows Vector code to be used on KNC that I
developed a while ago. The code is merged with the original VCL and available at:

https://bitbucket.org/edanor/vclknc_integrated

The code acts the same way and uses the same approach as VCL. The VCL
documentation from:

http://www.agner.org/optimize/#vectorclass

applies in general also for VCLKNC.

5) "I want to vectorize some standard containers without the changes of my code"

Ther is YET ANOTHER explicit vectorization library: boost::simd available at:

https://github.com/jfalcou/boost.simd

This library targets improvements in boost and cooperation with existing boost
components. While it doesn't provide solution for STL components, perhaps you could
just switch to boost.

// ***************************************************************************
// IV. PERFORMANCE
// ***************************************************************************

Vectorization concepts were introduced into CPUs for one reason, and one reason only: performance.
This library is designed so that it is possible to extract as much as possible
performance from CPUs, however it targets solving vectorization concepts in general.
The number of supported data types is large, and it is larger than number of data
types supported as SIMD register types in most of existing vectorisation ISA. The
reason for that is to give the user the biggest flexibility in terms of software
development as possible and to prepare code for execution on new architectures that
might be available in 3-5 years.

Developing intrinsic code for all (over 60!!!) vector types and for all supported
instruction sets is a time consuming task. Instead of ENABLING FUNCTIONALITY, this
library gives full programming interface using scalar types and will ENABLE
PERFORMANCE over time.

Using extensive scalar emulation makes it possible to compile UME::SIMD based code for
all types of SIMD to test result correctness and to be able to perform development even
on platforms that don't support vectorization! Since the interface is close to being
complete (starting with first version of the library!), the functionality will not
change over time (hopefully...).

This will increase the portability of the code as well as code reusability and eliminate
some of the problems that HPC programmers have to spend time on over and over again.

// ***************************************************************************
// V. COMPATIBILITY
// ***************************************************************************

The code is compiled on a regular basis using following compilers:
- MS Visual C++ compiler (CL v 17.00 and newer, available since MS Visual Studio 2012)
- GNU g++ compiler (4.8.2 and higher)
- INTEL C++ compiler (15.0 and higher)
- CLANG++ compiler (3.8 and higher)

In current version following instruction sets are supported and targeted for full
implementation:
- AVX + FMA
- AVX2 + FMA
- AVX512 (With planned support for Intel Xeon Phi: Knight's Landing (KNL) and Skylake (SKX) processors)
- IMCI (Initial Many-Core Instructions, an Intel Xeon Phi: Knight's Corner
coprocessor instruction set)

// NOTE: Code compiled for SSExxx will compile to scalar emulation. This is because we
// don't plan to support SSE instruction set as a standalone plugin. The baseline was set
// on AVX. If you are desperate for SSEx support, please let us know.

// NOTE: While current support is targeted for instruction sets developed for Intel
// processors the interface can be developed also for other ISA (such as ARM NEON). If
// you think there is a visible need to develop support for other instruction sets and
// you are willing to spend your resources (time or/and money) on that, feel invited.

// ***************************************************************************
// VI. WORKFLOW - REQUESTING HELP
// ***************************************************************************

The process of ENABLING PERFORMANCE will be performed over time. Unfortunately this
can only be done by first specializing every possible combination of:

1) SIMD vector type (over 60)
2) Instruction set (4 currently)
3) operation (~250 operations per instruction simd type)

this gives in total 60000 overloadable member functions! While not all of it has to be
overloaded (e.g. SIMD1 and SIMD2 can be handed pretty well using scalar emulation)
the amount resolves to really large code basis.

The overloading of a single member function can take as little as 1 line, and as much as
100 lines of intrinsic code. Adding to that some testing code (average unit test length
is ~5 lines of code) this gives in total at least 3`000`000 lines of code (most of
it using intrinsics).

Because of limited resources of this project it is necessary to rely on community
support. If you are a user and you would like to improve code basis, but you don't have
time to do development, you can still help! All you have to do is to submit an issue in
our tracking system with [PERFORMANCE REQUEST] tag in the title. The issue should state:

1. instruction set to be enabled
2. data type to be optimized
3. list of member functions that are required to be optimized
4. number of cores (order of magnitude) you are targeting with your application

// NOTE: as amdahl's law states, we should only be focusing on things that give the best
// return. And so the tickets will be handled based on above information,
// prioritizing very small and easy to do modifications, and requesters with large
// setups. Of course some exceptions still apply.
//
// If a similar issue already exists, please add a comment saying that you are also
// interested in such improvement. This would help in effective prioritizing.

The code might be buggy so far. This comes from the amount of combinations inside the
library. Most of the bugs should be pretty easy to resolve and were not caught yet
because of small number of unit tests developed (using above estimeates, the first
target would be to have at least one unit test for each instruction of each SIMD type,
but this is over 15000 unit tests, and a long way to go). If you find a bug in the
software, please submit a ticket with [BUG] tag in the title. If you can provide list
of:

1. instruction set
2. data type
3. member function

that cause a problem or implement a unit test that exposes the bug - that would be great!

// NOTE: bugs will be treated with higher priority than [PERFORMANCE REQUESTS] because
// they signify structural or functional problems. While performance is the goal of
// SIMD programming, the correctness is still more important.

There are some problems that we are not planning to resolve as:
- supporting previous lower versions of compilers
- implementing the interface in other languages
- adding support for CUDA
- overloading C++ operators to perform some operations

If you think it is necessary to have certain functionality and it is still not there,
please feel free to submit a ticket or ask a direct question.