Unfriendly ZE kinds support when lacking the hardware

Issue #636 new
Paul Hargrove created an issue

I see that running a CUDA- or HIP-enabled test-memory_kinds on a system w/o Nvidia or AMD GPUs have a graceful behavior which include a non-fatal error message (among other output):

cuInit() failed: CUDA_ERROR_NO_DEVICE

or

hipInit() failed: hipErrorNoDevice

However, the same is not currently true (in my testing at least) with ze kinds, where lack of the h/w yields a fatal error:

zeInit() failed: err=0x78000001
ZE API VERSION: 1.9
Found 0 ZE devices:
*** FATAL ERROR (proc 0):
//////////////////////////////////////////////////////////////////////
UPC++ ZE call failed:
 on process 0 (cgpu-1)
 at /home/pagoda1/phargrov/upcxx/src/./ze.cpp:29

zeInit(0)
  error=ZE_RESULT_ERROR_UNINITIALIZED: [Validation] driver is not initialized

ZE info:
zeInit() failed: err=0x78000001
ZE API VERSION: 1.9
Found 0 ZE devices:


To have UPC++ freeze during these errors so you can attach a debugger,
rerun the program with GASNET_FREEZE_ON_ERROR=1 in the environment.
//////////////////////////////////////////////////////////////////////

The relevant stack frames show this is a result of the ze_init() call from enumerate_ze_devices():

[0] #11 0x00005e7276655e5f in upcxx::detail::ze::ze_failed (res=ZE_RESULT_ERROR_UNINITIALIZED, file=0x5e7276b5c8a8 "/[REDACTED]/upcxx/src/./ze.cpp", line=29, expr=0x5e7276b5c89d "zeInit(0)", report_verbose=true) at /[REDACTED]/upcxx/src/./ze.cpp:353
[0] #12 0x00005e7276652f8d in (anonymous namespace)::ze_init (errors_return=false) at /[REDACTED]/upcxx/src/./ze.cpp:29
[0] #13 0x00005e727665b846 in (anonymous namespace)::enumerate_ze_devices<upcxx::ze_device::device_n()::<lambda()>::<lambda(upcxx::gpu_device::id_type, ...)> >(struct {...} &&) (fn=...) at /[REDACTED]/upcxx/src/./ze.cpp:56
[0] #14 0x00005e727665691a in operator() (__closure=0x7fff6d1eab3f) at /[REDACTED]/upcxx/src/./ze.cpp:418
[0] #15 0x00005e72766569ec in upcxx::ze_device::device_n () at /[REDACTED]/upcxx/src/./ze.cpp:424
[0] #16 0x00005e7276559482 in main () at /[REDACTED]/upcxx/test/memory_kinds.cpp:406

I think that usability, especially on a heterogeneous cluster, argues for updating ze kinds support to match the behavior demonstrated by the cuda and hip support. It seems unfortunate that level_zero is not distinguishing "init rand but found no devices" from simply "not initialized".

Comments (0)

  1. Log in to comment