This PR adds GlobalArray, which uses managed memory (cudaMallocManaged) and should replace GPUArray.
I just typed in a lot of information on this PR, but accidentally closed the Browser tab. I will add this information back as move forward. Basically this PR serves as a progress tracker.
ChangeLog
Support multi-GPU execution on dense nodes using CUDA managed memory. Execute with --gpu=0,1,..,n-1 command line option to run on the first n GPUs (Pascal and above).
Node-local acceleration is implemented for a subset of kernels. Performance improvements may vary.
Improvements are only expected with NVLINK hardware. Use MPI when NVLINK is not available.
Combine the --gpu=.. command line option with mpirun to execute on many dense nodes
This PR adds
GlobalArray
, which uses managed memory (cudaMallocManaged
) and should replaceGPUArray
.I just typed in a lot of information on this PR, but accidentally closed the Browser tab. I will add this information back as move forward. Basically this PR serves as a progress tracker.
ChangeLog
Support multi-GPU execution on dense nodes using CUDA managed memory. Execute with
--gpu=0,1,..,n-1
command line option to run on the first n GPUs (Pascal and above).Node-local acceleration is implemented for a subset of kernels. Performance improvements may vary.
Improvements are only expected with NVLINK hardware. Use MPI when NVLINK is not available.
Combine the
--gpu=..
command line option withmpirun
to execute on many dense nodes