Kokkos_3dhalo example crashes when using multiple HIP streams

Issue #577 resolved
Daniel Waters created an issue

The kokkos_3dhalo example has a different Kokkos::ExecutionSpace for the computation kernel of each surface in the domain. Unless KOKKOS_ENABLE_DEBUG is defined, when the application is compiled for GPU execution is creates each of these spaces within a separate CUDA/HIP stream. On OLCF’s Crusher, we’ve noticed that the application crashes with the following call stack:

[1] #0  0x00007fffe6e5dbc9 in ?? () from /opt/rocm-5.1.0/lib/libhsa-runtime64.so.1

[1] #1  0x00007fffe6e5da9a in ?? () from /opt/rocm-5.1.0/lib/libhsa-runtime64.so.1

[1] #2  0x00007fffe6e51a69 in ?? () from /opt/rocm-5.1.0/lib/libhsa-runtime64.so.1

[1] #3  0x00007fffe955e03b in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5

[1] #4  0x00007fffe954d59a in ?? () from /opt/rocm-5.1.0/lib/libamdhip64.so.5

[1] #5  0x00007fffe939f419 in hipDeviceSynchronize () from /opt/rocm-5.1.0/lib/libamdhip64.so.5

[1] #6  0x000000000046f7ce in Kokkos::Experimental::HIP::impl_static_fence(std::__cxx11::basic_string<char, std::               char_traits<char>, std::allocator<char> > const&) ()

[1] #7  0x000000000049cfbd in Kokkos::fence(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >     const&) ()

[1] #8  0x000000000040cd61 in System::timestep (this=0x7fffffff6740) at upcxx_heat_conduction.cpp:417

[1] #9  0x000000000040a47b in main (argc=1, argv=0x7fffffff6de8) at upcxx_heat_conduction.cpp:705

Currently, the plan is to resolve this by having all surface computation kernels happen in the same execution space (and hence the same stream) when UPC++ is configured with HIP enabled. All seven execution spaces still exist in the application for the purpose of overlapping computation of the interior domain’s kernel with communication of the six surface cell data buffers.

Comments (1)

  1. Log in to comment