ROCm SMI component segfaults on AMD Laptop
I’ve installed PAPI 7.0.1 on my AMD laptop with ROCm 5.5 RC5 (“5.5.0.0-50-9e3718c”) but running papi_component_avail
segfaults:
Available components and hardware information.
--------------------------------------------------------------------------------
PAPI version : 7.0.1.0
Operating system : Linux 5.15.0-60-generic
Vendor string and code : AuthenticAMD (2, 0x2)
Model string and code : AMD Ryzen 7 PRO 4750U with Radeon Graphics (96, 0x60)
CPU revision : 1.000000
CPUID : Family/Model/Stepping 23/96/1, 0x17/0x60/0x01
CPU Max MHz : 1700
CPU Min MHz : 1400
Total cores : 16
SMT threads per core : 2
Cores per socket : 8
Sockets : 1
Cores per NUMA region : 16
NUMA regions : 1
Running in a VM : no
Number Hardware Counters : 5
Max Multiplex Counters : 384
Fast counter read (rdpmc): yes
--------------------------------------------------------------------------------
Compiled-in components:
Name: perf_event Linux perf_event CPU counters
Name: perf_event_uncore Linux perf_event CPU uncore and northbridge
Name: rocm GPU events and metrics via AMD ROCm-PL API
\-> Disabled: rocprofiler_iterate_info(), Translate(), ImportMetrics: bad block name 'GRBM', GFXIP is not supported(gfx90c)
Name: rocm_smi AMD GPU System Management Interface via rocm_smi_lib
Segmentation fault (core dumped)
Here is a backtrace:
#0 0x00007ffff763850d in rsmi_func_iter_value_get () from /opt/rocm/lib/librocm_smi64.so
#1 0x00005555556730d0 in get_ntv_events_count () at components/rocm_smi/rocs.c:1083
#2 init_event_table () at components/rocm_smi/rocs.c:910
#3 rocs_init () at components/rocm_smi/rocs.c:365
#4 0x000055555566af25 in _rocm_smi_init_private () at components/rocm_smi/linux-rocm-smi.c:106
#5 0x000055555566b089 in _rocm_smi_check_n_initialize () at components/rocm_smi/linux-rocm-smi.c:48
#6 _rocm_smi_ntv_enum_events (EventCode=0x7fffffffc114, modifier=1) at components/rocm_smi/linux-rocm-smi.c:353
#7 0x000055555564f1ba in PAPI_enum_cmp_event (EventCode=EventCode@entry=0x7fffffffc174, modifier=modifier@entry=1, cidx=cidx@entry=3) at papi.c:1957
#8 0x000055555564e41f in force_cmp_init (cid=3) at papi_component_avail.c:196
#9 main (argc=<optimized out>, argv=<optimized out>) at papi_component_avail.c:122
Extracting the code around rocs.c:1083 into its own file, I can see, that the return code from rsmi_dev_supported_func_iterator_open
is not SUCCESS but not checked:
int
main(int ac, char *av[])
{
rsmi_status_t status = rsmi_init(0);
if (status != RSMI_STATUS_SUCCESS) abort();
uint32_t device_count;
status = rsmi_num_monitor_devices(&device_count);
if (status != RSMI_STATUS_SUCCESS) abort();
uint32_t dev;
for (dev = 0; dev < device_count; ++dev) {
rsmi_func_id_iter_handle_t iter;
status = rsmi_dev_supported_func_iterator_open(dev, &iter);
if (status != RSMI_STATUS_SUCCESS)
{ fprintf(stderr, "%d\n", status); continue; }
while (1) {
rsmi_func_id_value_t v_name;
status = rsmi_func_iter_value_get(iter, &v_name);
if (status != RSMI_STATUS_SUCCESS) abort();
rsmi_func_id_iter_handle_t var_iter;
status = rsmi_dev_supported_variant_iterator_open(iter, &var_iter);
if (status != RSMI_STATUS_SUCCESS) abort();
status = rsmi_func_iter_next(iter);
if (status == RSMI_STATUS_NO_DATA) {
break;
}
}
status = rsmi_dev_supported_func_iterator_close(&iter);
if (status != RSMI_STATUS_SUCCESS) abort();
}
rsmi_shut_down();
return 0;
}
The continue
in line 16 lets this code runs successfully, but a similar patch to PAPI results in an different abort.
Patch:
diff --git i/src/components/rocm_smi/rocs.c w/src/components/rocm_smi/rocs.c
index 92f09d1cf..05a9d8bf1 100644 src/components/rocm_smi/rocs.c
--- i/src/components/rocm_smi/rocs.c
+++ w/src/components/rocm_smi/rocs.c
@@ -1079,6 +1079,9 @@ get_ntv_events_count(void)
int32_t dev;
for (dev = 0; dev < device_count; ++dev) {
status = rsmi_dev_supported_func_iterator_open_p(dev, &iter);
+ if (status == RSMI_STATUS_SUCCESS) {
+ continue;
+ }
while (1) {
status = rsmi_func_iter_value_get_p(iter, &v_name);
status = rsmi_dev_supported_variant_iterator_open_p(iter, &var_iter);
Output:
Name: rocm_smi AMD GPU System Management Interface via rocm_smi_lib
malloc(): invalid size (unsorted)
Aborted (core dumped)
Backtrace:
#0 __GI_raise (sig=sig@entry=6) at ../sysdeps/unix/sysv/linux/raise.c:50
#1 0x00007ffff7dc2859 in __GI_abort () at abort.c:79
#2 0x00007ffff7e2d26e in __libc_message (action=action@entry=do_abort, fmt=fmt@entry=0x7ffff7f57298 "%s\n") at ../sysdeps/posix/libc_fatal.c:155
#3 0x00007ffff7e352fc in malloc_printerr (str=str@entry=0x7ffff7f59a50 "malloc(): invalid size (unsorted)") at malloc.c:5347
#4 0x00007ffff7e380b4 in _int_malloc (av=av@entry=0x7ffff7f8cb80 <main_arena>, bytes=bytes@entry=24) at malloc.c:3736
#5 0x00007ffff7e3bb95 in __libc_calloc (n=<optimized out>, elem_size=<optimized out>) at malloc.c:3428
#6 0x000055555567446c in create_table_entry (entry=0x7fffffffc068, val=0x5555559a6bd8, key=0x5555559a6c40 "device_brand:device=0") at components/rocm_smi/htable.h:324
#7 htable_insert (in=0x5555559a6bd8, key=<optimized out>, handle=0x555555959f70) at components/rocm_smi/htable.h:106
#8 get_ntv_events (count=3, events=0x5555559a6920) at components/rocm_smi/rocs.c:1212
#9 init_event_table () at components/rocm_smi/rocs.c:919
#10 rocs_init () at components/rocm_smi/rocs.c:365
#11 0x000055555566af25 in _rocm_smi_init_private () at components/rocm_smi/linux-rocm-smi.c:106
#12 0x000055555566b089 in _rocm_smi_check_n_initialize () at components/rocm_smi/linux-rocm-smi.c:48
#13 _rocm_smi_ntv_enum_events (EventCode=0x7fffffffc124, modifier=1) at components/rocm_smi/linux-rocm-smi.c:353
#14 0x000055555564f1ba in PAPI_enum_cmp_event (EventCode=EventCode@entry=0x7fffffffc184, modifier=modifier@entry=1, cidx=cidx@entry=3) at papi.c:1957
#15 0x000055555564e41f in force_cmp_init (cid=3) at papi_component_avail.c:196
#16 main (argc=<optimized out>, argv=<optimized out>) at papi_component_avail.c:122
I stopped investigating here.
Comments (6)
-
-
-
assigned issue to
-
assigned issue to
-
@Bert Wesarg could you try replacing the
==
with!=
in your patch? Also, It looks like you have an integrated radeon GPU. I never tested the rocm_smi component with integrated radeon GPUs. If you also have a dedicated AMD GPU that might cause some confusion to the component. As far as I can see, running on MI200 gives no problem. However, you are right, that code should test every rsmi call. -
reporter You are correct, that the patch looks broken. You are also correct, that this is an integrated AMD GPU. But the component should nevertheless not segfault on such a machine, right?
-
@Bert Wesarg yes, you are right. I copied that code from an AMD example which was obviously not intended for production code. My bad. I will create a patch.
-
@Bert Wesarg could you please give this https://bitbucket.org/icl/papi/pull-requests/474 a try and see if it fixes the issue you are experiencing?
- Log in to comment
Thank you for reporting this Bert.