ROCm SMI component segfaults on AMD Laptop

Issue #134 new
Bert Wesarg created an issue

I’ve installed PAPI 7.0.1 on my AMD laptop with ROCm 5.5 RC5 (“5.5.0.0-50-9e3718c”) but running papi_component_avail segfaults:

Available components and hardware information.
--------------------------------------------------------------------------------
PAPI version             : 7.0.1.0
Operating system         : Linux 5.15.0-60-generic
Vendor string and code   : AuthenticAMD (2, 0x2)
Model string and code    : AMD Ryzen 7 PRO 4750U with Radeon Graphics (96, 0x60)
CPU revision             : 1.000000
CPUID                    : Family/Model/Stepping 23/96/1, 0x17/0x60/0x01
CPU Max MHz              : 1700
CPU Min MHz              : 1400
Total cores              : 16
SMT threads per core     : 2
Cores per socket         : 8
Sockets                  : 1
Cores per NUMA region    : 16
NUMA regions             : 1
Running in a VM          : no
Number Hardware Counters : 5
Max Multiplex Counters   : 384
Fast counter read (rdpmc): yes
--------------------------------------------------------------------------------

Compiled-in components:
Name:   perf_event              Linux perf_event CPU counters
Name:   perf_event_uncore       Linux perf_event CPU uncore and northbridge
Name:   rocm                    GPU events and metrics via AMD ROCm-PL API
   \-> Disabled: rocprofiler_iterate_info(), Translate(), ImportMetrics: bad block name 'GRBM', GFXIP is not supported(gfx90c)

Name:   rocm_smi                AMD GPU System Management Interface via rocm_smi_lib
Segmentation fault (core dumped)

Here is a backtrace:

#0  0x00007ffff763850d in rsmi_func_iter_value_get () from /opt/rocm/lib/librocm_smi64.so
#1  0x00005555556730d0 in get_ntv_events_count () at components/rocm_smi/rocs.c:1083
#2  init_event_table () at components/rocm_smi/rocs.c:910
#3  rocs_init () at components/rocm_smi/rocs.c:365
#4  0x000055555566af25 in _rocm_smi_init_private () at components/rocm_smi/linux-rocm-smi.c:106
#5  0x000055555566b089 in _rocm_smi_check_n_initialize () at components/rocm_smi/linux-rocm-smi.c:48
#6  _rocm_smi_ntv_enum_events (EventCode=0x7fffffffc114, modifier=1) at components/rocm_smi/linux-rocm-smi.c:353
#7  0x000055555564f1ba in PAPI_enum_cmp_event (EventCode=EventCode@entry=0x7fffffffc174, modifier=modifier@entry=1, cidx=cidx@entry=3) at papi.c:1957
#8  0x000055555564e41f in force_cmp_init (cid=3) at papi_component_avail.c:196
#9  main (argc=<optimized out>, argv=<optimized out>) at papi_component_avail.c:122

Extracting the code around rocs.c:1083 into its own file, I can see, that the return code from rsmi_dev_supported_func_iterator_open is not SUCCESS but not checked:

int
main(int ac, char *av[])
{
    rsmi_status_t status = rsmi_init(0);
    if (status != RSMI_STATUS_SUCCESS) abort();

    uint32_t device_count;
    status = rsmi_num_monitor_devices(&device_count);
    if (status != RSMI_STATUS_SUCCESS) abort();

    uint32_t dev;
    for (dev = 0; dev < device_count; ++dev) {
        rsmi_func_id_iter_handle_t iter;
        status = rsmi_dev_supported_func_iterator_open(dev, &iter);
        if (status != RSMI_STATUS_SUCCESS)
        { fprintf(stderr, "%d\n", status); continue; }

        while (1) {
            rsmi_func_id_value_t v_name;
            status = rsmi_func_iter_value_get(iter, &v_name);
            if (status != RSMI_STATUS_SUCCESS) abort();

            rsmi_func_id_iter_handle_t var_iter;
            status = rsmi_dev_supported_variant_iterator_open(iter, &var_iter);
            if (status != RSMI_STATUS_SUCCESS) abort();

            status = rsmi_func_iter_next(iter);
            if (status == RSMI_STATUS_NO_DATA) {
                break;
            }
        }

        status = rsmi_dev_supported_func_iterator_close(&iter);
        if (status != RSMI_STATUS_SUCCESS) abort();
    }

    rsmi_shut_down();

    return 0;
}

The continue in line 16 lets this code runs successfully, but a similar patch to PAPI results in an different abort.

Patch:

diff --git i/src/components/rocm_smi/rocs.c w/src/components/rocm_smi/rocs.c
index 92f09d1cf..05a9d8bf1 100644 src/components/rocm_smi/rocs.c
--- i/src/components/rocm_smi/rocs.c
+++ w/src/components/rocm_smi/rocs.c
@@ -1079,6 +1079,9 @@ get_ntv_events_count(void)
     int32_t dev;
     for (dev = 0; dev < device_count; ++dev) {
         status = rsmi_dev_supported_func_iterator_open_p(dev, &iter);
+        if (status == RSMI_STATUS_SUCCESS) {
+            continue;
+        }
         while (1) {
             status = rsmi_func_iter_value_get_p(iter, &v_name);
             status = rsmi_dev_supported_variant_iterator_open_p(iter, &var_iter);

Output:

Name:   rocm_smi                AMD GPU System Management Interface via rocm_smi_lib
malloc(): invalid size (unsorted)
Aborted (core dumped)

Backtrace:

#0  __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1  0x00007ffff7dc2859 in __GI_abort () at abort.c:79
#2  0x00007ffff7e2d26e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7f57298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3  0x00007ffff7e352fc in malloc_printerr (str=str@entry=0x7ffff7f59a50 "malloc(): invalid size (unsorted)") at malloc.c:5347
#4  0x00007ffff7e380b4 in _int_malloc (av=av@entry=0x7ffff7f8cb80 <main_arena>, bytes=bytes@entry=24) at malloc.c:3736
#5  0x00007ffff7e3bb95 in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at malloc.c:3428
#6  0x000055555567446c in create_table_entry (entry=0x7fffffffc068, val=0x5555559a6bd8, key=0x5555559a6c40 "device_brand:device=0") at components/rocm_smi/htable.h:324
#7  htable_insert (in=0x5555559a6bd8, key=<optimized out>, handle=0x555555959f70) at components/rocm_smi/htable.h:106
#8  get_ntv_events (count=3, events=0x5555559a6920) at components/rocm_smi/rocs.c:1212
#9  init_event_table () at components/rocm_smi/rocs.c:919
#10 rocs_init () at components/rocm_smi/rocs.c:365
#11 0x000055555566af25 in _rocm_smi_init_private () at components/rocm_smi/linux-rocm-smi.c:106
#12 0x000055555566b089 in _rocm_smi_check_n_initialize () at components/rocm_smi/linux-rocm-smi.c:48
#13 _rocm_smi_ntv_enum_events (EventCode=0x7fffffffc124, modifier=1) at components/rocm_smi/linux-rocm-smi.c:353
#14 0x000055555564f1ba in PAPI_enum_cmp_event (EventCode=EventCode@entry=0x7fffffffc184, modifier=modifier@entry=1, cidx=cidx@entry=3) at papi.c:1957
#15 0x000055555564e41f in force_cmp_init (cid=3) at papi_component_avail.c:196
#16 main (argc=<optimized out>, argv=<optimized out>) at papi_component_avail.c:122

I stopped investigating here.

Comments (6)

  1. Giuseppe Congiu

    @Bert Wesarg could you try replacing the == with != in your patch? Also, It looks like you have an integrated radeon GPU. I never tested the rocm_smi component with integrated radeon GPUs. If you also have a dedicated AMD GPU that might cause some confusion to the component. As far as I can see, running on MI200 gives no problem. However, you are right, that code should test every rsmi call.

  2. Bert Wesarg reporter

    You are correct, that the patch looks broken. You are also correct, that this is an integrated AMD GPU. But the component should nevertheless not segfault on such a machine, right?

  3. Giuseppe Congiu

    @Bert Wesarg yes, you are right. I copied that code from an AMD example which was obviously not intended for production code. My bad. I will create a patch.

  4. Log in to comment