Error running HIP/ROCm examples: "hipErrorNoBinaryForGpu: Unable to find code object for all current devices!"

Over at least the past two days, we have seen the following error running hip_vecadd (in upcxx:example/gpu_vecadd) and the Kokkos examples (in upcxx-extras:examples/kokkos_{3dhalo,montecarlo}) on a system with AMD GPUs:

hipErrorNoBinaryForGpu: Unable to find code object for all current devices!

This is an indication that hipcc did not generate GPU kernels for the GPU architecture(s) detected at run time.

Under "ideal conditions", hipcc either honors the environment variable HCC_AMDGPU_TARGET or uses a helper program to determine the GPUs in the compilation host. Either way the proper code generation is performed entirely transparently. However, the use of a helper program fails consistently when there is no GPU (or the wrong GPU) in the compilation host. What we saw recently was a transient condition in which a user monopolizing the GPU on a login node seems to have prevented the helper from probing the GPU. This paragraph is for background, and represents information I don't want our end-users to be burdened with.

This issue is a three-part task to

Determine "best practices" for configuring an explicit AMD GPU architecture when building (at least) hip_vecadd and the two Kokkos examples.
Update the Makefiles as necessary to implement these best practices
Update documentation to share these best practices with end-users

My thoughts:

While there is still an issue to be worked out, I am hoping that the advice for the Kokkos examples should be the same for Nvidia and AMD GPUs: set KOKKOS_ARCH. There is a Slack thread right now in which a resulting undesired linker interaction is being discussed.

In the Makefile for gpu_vecadd, the Nvidia case uses NVCCARCH and NVCCARCH_FLAGS environment variables. I imagine we will deploy something similar such as HIPCCARCH, but this needs some discussion (in particular do we need two variables?).

I want to note that the environment approach for gpu_vecadd with nvcc is not documented in the corresponding README.md, but is documented with the cannon and jac3d examples in upcxx-extras. I think that documentation should be cloned to example/gpu_vecadd/README.md as part of this issue as well (unless there is a reason not to that I am missing).

Note that this task spans both the upcxx and upcxx-extras repos, but without any dependency between the two.

Comments (1)