Problem with rocm component within spack

Issue #86 resolved
Gerald Ragghianti created an issue

I’m working to implement the rocm component within the papi spack package. I have discovered an odd behavior in which if I set $HSA_TOOLS_LIB correctly to the librocprofiler64.so library, papi is unable to report available rocm counters (papi_component_avail), but if I set the $HSA_TOOLS_LIB variable incorrectly (thus preventing loading of the library) papi_component_avail does report the rocm counters and activated the rocm component.

Failure when using correct library name:

$ HSA_TOOLS_LIB=librocprofiler64.so papi_component_avail
components/rocm/linux-rocm.c:501 error: function hsa_init failed with error 4096.
Available components and hardware information.
--------------------------------------------------------------------------------
PAPI version             : 6.0.0.1
Operating system         : Linux 3.10.0-1127.el7.x86_64
Vendor string and code   : AuthenticAMD (2, 0x2)
Model string and code    : AMD EPYC 7413 24-Core Processor (1, 0x1)
CPU revision             : 1.000000
CPUID                    : Family/Model/Stepping 25/1/1, 0x19/0x01/0x01
CPU Max MHz              : 2650
CPU Min MHz              : 1500
Total cores              : 96
SMT threads per core     : 2
Cores per socket         : 24
Sockets                  : 2
Cores per NUMA region    : 48
NUMA regions             : 2
Running in a VM          : no
Number Hardware Counters : 0
Max Multiplex Counters   : 384
Fast counter read (rdpmc): no
--------------------------------------------------------------------------------

Compiled-in components:
Name:   perf_event              Linux perf_event CPU counters
   \-> Disabled: Unknown libpfm4 related error
Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
   \-> Disabled: No uncore PMUs or events found
Name:   example                 A simple example component
Name:   rocm                    GPU events and metrics via AMD ROCm-PL API
   \-> Disabled: 

Active components:
Name:   example                 A simple example component
                                Native: 4, Preset: 0, Counters: 3


--------------------------------------------------------------------------------

Success with incorrect library name:

$ HSA_TOOLS_LIB=fakename.so papi_component_avail
Tool lib "fakename.so" failed to load.
Available components and hardware information.
--------------------------------------------------------------------------------
PAPI version             : 6.0.0.1
Operating system         : Linux 3.10.0-1127.el7.x86_64
Vendor string and code   : AuthenticAMD (2, 0x2)
Model string and code    : AMD EPYC 7413 24-Core Processor (1, 0x1)
CPU revision             : 1.000000
CPUID                    : Family/Model/Stepping 25/1/1, 0x19/0x01/0x01
CPU Max MHz              : 2650
CPU Min MHz              : 1500
Total cores              : 96
SMT threads per core     : 2
Cores per socket         : 24
Sockets                  : 2
Cores per NUMA region    : 48
NUMA regions             : 2
Running in a VM          : no
Number Hardware Counters : 0
Max Multiplex Counters   : 384
Fast counter read (rdpmc): no
--------------------------------------------------------------------------------

Compiled-in components:
Name:   perf_event              Linux perf_event CPU counters
   \-> Disabled: Unknown libpfm4 related error
Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
   \-> Disabled: No uncore PMUs or events found
Name:   example                 A simple example component
Name:   rocm                    GPU events and metrics via AMD ROCm-PL API

Active components:
Name:   example                 A simple example component
                                Native: 4, Preset: 0, Counters: 3

Name:   rocm                    GPU events and metrics via AMD ROCm-PL API
                                Native: 636, Preset: 0, Counters: 636


--------------------------------------------------------------------------------

I have also found (using strace) that in the failure example, papi_component_avail does indeed load the correct librocprofiler64.so library (probably one of the rocm libraries loads it), but then papi itself tries to load it again. I don’t know if this is a hint at the source of the problem.

Comments (3)

  1. Gerald Ragghianti reporter

    To reproduce my experiments, you would need to build papi within my spack release. It is on github at https://github.com/G-Ragghianti/spack in the papi_rocm branch. Once you check it out, you would build it like this:

    source spack/share/spack/setup-env.sh
    module load gcc/7.3.0
    spack compiler find
    spack install papi+rocm amdgpu_target=gfx900
    spack load papi
    papi_component_avail
    

  2. Log in to comment