This is a set of fast implementations of a simple Trotter-Suzuki solver.
(c) 2010-2012 by Carlos Bederián <email@example.com>
reference/ has a naive CPU implementation for testing purposes.
sse/ has a fast CPU implementation using SSE intrinsics and a red-black split of
the matrices. This implementation is limited by memory bandwidth and doesn't
perform well on large systems that don't fit in cache.
block/ adds cache tiling on top of the red-black SSE code. This code isn't fully
optimized (some overhead can be removed) but it scales better than the simple
SSE code for large systems.
cuda/ has a fast GPGPU implementation tuned for Fermi-class GPUs, also using a
Our paper on these implementations: