This is a set of fast implementations of a simple Trotter-Suzuki solver.
(c) 2010-2012 by Carlos Bederián <>

reference/ has a naive CPU implementation for testing purposes.

sse/ has a fast CPU implementation using SSE intrinsics and a red-black split of
the matrices. This implementation is limited by memory bandwidth and doesn't
perform well on large systems that don't fit in cache.

block/ adds cache tiling on top of the red-black SSE code. This code isn't fully
optimized (some overhead can be removed) but it scales better than the simple
SSE code for large systems.

cuda/ has a fast GPGPU implementation tuned for Fermi-class GPUs, also using a
tiling strategy.

Our paper on these implementations: