Problem with rocm component within spack
I’m working to implement the rocm component within the papi spack package. I have discovered an odd behavior in which if I set $HSA_TOOLS_LIB correctly to the librocprofiler64.so library, papi is unable to report available rocm counters (papi_component_avail), but if I set the $HSA_TOOLS_LIB variable incorrectly (thus preventing loading of the library) papi_component_avail does report the rocm counters and activated the rocm component.
Failure when using correct library name:
$ HSA_TOOLS_LIB=librocprofiler64.so papi_component_avail
components/rocm/linux-rocm.c:501 error: function hsa_init failed with error 4096.
Available components and hardware information.
--------------------------------------------------------------------------------
PAPI version : 6.0.0.1
Operating system : Linux 3.10.0-1127.el7.x86_64
Vendor string and code : AuthenticAMD (2, 0x2)
Model string and code : AMD EPYC 7413 24-Core Processor (1, 0x1)
CPU revision : 1.000000
CPUID : Family/Model/Stepping 25/1/1, 0x19/0x01/0x01
CPU Max MHz : 2650
CPU Min MHz : 1500
Total cores : 96
SMT threads per core : 2
Cores per socket : 24
Sockets : 2
Cores per NUMA region : 48
NUMA regions : 2
Running in a VM : no
Number Hardware Counters : 0
Max Multiplex Counters : 384
Fast counter read (rdpmc): no
--------------------------------------------------------------------------------
Compiled-in components:
Name: perf_event Linux perf_event CPU counters
\-> Disabled: Unknown libpfm4 related error
Name: perf_event_uncore Linux perf_event CPU uncore and northbridge
\-> Disabled: No uncore PMUs or events found
Name: example A simple example component
Name: rocm GPU events and metrics via AMD ROCm-PL API
\-> Disabled:
Active components:
Name: example A simple example component
Native: 4, Preset: 0, Counters: 3
--------------------------------------------------------------------------------
Success with incorrect library name:
$ HSA_TOOLS_LIB=fakename.so papi_component_avail
Tool lib "fakename.so" failed to load.
Available components and hardware information.
--------------------------------------------------------------------------------
PAPI version : 6.0.0.1
Operating system : Linux 3.10.0-1127.el7.x86_64
Vendor string and code : AuthenticAMD (2, 0x2)
Model string and code : AMD EPYC 7413 24-Core Processor (1, 0x1)
CPU revision : 1.000000
CPUID : Family/Model/Stepping 25/1/1, 0x19/0x01/0x01
CPU Max MHz : 2650
CPU Min MHz : 1500
Total cores : 96
SMT threads per core : 2
Cores per socket : 24
Sockets : 2
Cores per NUMA region : 48
NUMA regions : 2
Running in a VM : no
Number Hardware Counters : 0
Max Multiplex Counters : 384
Fast counter read (rdpmc): no
--------------------------------------------------------------------------------
Compiled-in components:
Name: perf_event Linux perf_event CPU counters
\-> Disabled: Unknown libpfm4 related error
Name: perf_event_uncore Linux perf_event CPU uncore and northbridge
\-> Disabled: No uncore PMUs or events found
Name: example A simple example component
Name: rocm GPU events and metrics via AMD ROCm-PL API
Active components:
Name: example A simple example component
Native: 4, Preset: 0, Counters: 3
Name: rocm GPU events and metrics via AMD ROCm-PL API
Native: 636, Preset: 0, Counters: 636
--------------------------------------------------------------------------------
I have also found (using strace) that in the failure example, papi_component_avail does indeed load the correct librocprofiler64.so library (probably one of the rocm libraries loads it), but then papi itself tries to load it again. I don’t know if this is a hint at the source of the problem.
Comments (3)
-
reporter -
reporter @Tony_ICL
-
- changed status to resolved
I think we can mark this as resolved because now we understand this issue, as described in details in the following document:
https://docs.google.com/document/d/1oXppmZJjR77NDv-DqXvLl5AvqQVNCTnmamG5YKk3lDs/edit#
- Log in to comment
To reproduce my experiments, you would need to build papi within my spack release. It is on github at https://github.com/G-Ragghianti/spack in the papi_rocm branch. Once you check it out, you would build it like this: