magma_zhetrd_mgpu contains a hard-coded matrix size below which a CPU BLAS function is called instead of the GPU implementation. The threshold is
n=3000 which we found to severely hurt performance in a case with
n ~ 2500.
In general, I would suggest to make this value configurable in some way, ideally in a way controlled by the calling code (at compile or run time) and not at compile-time of MAGMA. The reasoning is, that there is no one-size-fits-all: A large value may work well when a single GPU is paired with potent CPU and a parallel BLAS, but GPU compute nodes often come with less potent CPUs, making a medium value better. The case described above falls within the latter category (2x 12-core Skylake + 4x P100).
In some cases, however, one may intentionally use a non-parallel CPU-BLAS (or disable nested OpenMP) when one is using CPU threads for some precomputation which then offload BLAS to GPU, in which case the cutoff should very small, or 0.
A quick fix might be to set the value (likewise the default value in a configurable scenario) to
n=512 which appears to be a chunk size used by MAGMA here.