triNNity is a C++ library implementing a bunch of CNN primitives. Supported platforms include x86_64 and aarch64.

How do I get set up?

All you need is a C++ compiler. The library has been tested with icpc, clang++, and g++.

A PKGBUILD file is included for Arch Linux. To build the package do make package. You can install the library system-wide with pacman -U passing the generated .tar.xz file.


Please see te triNNity-demos project for examples for several popular CNNs.

Library Structure

Each module in the library exports both a low and a high level interface. The low-level interface exposes the library operations via template functions, while the high-level interface exposes the library operations as Layer objects. The principal difference is that the low level interface does absolutely zero memory management, while the high level interface will manage your kernel buffers for you.

To use either interface, simple #include the relevant header file - for the low-level interface this is <triNNity/module/mcmk.h>, and for the high-level interface, it is <triNNity/module/layer.h> You can use more than one module at once - for example, if you are building a mixed dense-sparse network on CPU, using the high-level interface, simply do:

#include <triNNity/dense/cpu/layer.h>
#include <triNNity/sparse/cpu/layer.h>

We use namespaces to logically collect all of the library operations. For example, to use the low level interface for the dense/cpu module, say using triNNity::dense::cpu. To use the high level interface, say using triNNity::layer. You can of course use both interfaces at once, if you would like to use high-level code to implement some parts of your network, but need the precision of the low-level interface in other parts.

Available Modules

  • dense/cpu
  • dense/cuda
  • dense/opencl
  • sparse/cpu
  • spectral/cpu

Using a GPU/Accelerator

We support the use of GPUs and other accelerator devices. When using such devices, the issue of offload becomes important. To maximize performance, the copying of data to and from the device from the host usually needs to be minimized.

We support two modes for offload of data to accelerators. The first mode simply pushes and pulls arrays to and from the device whenever necessary. This is useful when prototyping an application, because no changes to the code are necessary to make the CPU version make use of the accelerator. You can use this mode by creating Layer objects from a cpu module (e.g. dense/cpu), but specifying that the work should be done on the accelerator (e.g. with #define TRINNITY_USE_CUBLAS_GEMM).

The second mode assumes that your data lives on the accelerator device. For this mode, simply switch from using a cpu module to the corresponding module for your accelerator (e.g. from dense/cpu to dense/cuda). All Layer objects from accelerator modules presume that their inputs and outputs are stored in device memory, and allocate all their intermediate buffers in device memory.