Some developer tests fail when lacking a GPU device

Issue #648 wontfix
Paul Hargrove created an issue

Following up on Issue #636 and discussion in the corresponding PR #508, this is a report of a number of developer-only tests which exhibit "bad" behavior when run on a system lacking a GPU for an enabled memory kind.

The following tests all fail "ungracefully", with an abort(), when UPC++ has been configured for a kind for which no device is detected. A kind appearing in parentheses indicates a test which is only run when that kind is enabled:

  • test-bad-segment-alloc
  • test-copy
  • test-gpu_microbenchmark
  • test-h-d-remote
  • test-rpc-ctor-trace
  • test-h-d (cuda)
  • test-cuda-context (cuda)
  • test-cuda_vecadd (cuda)
  • test-hip_vecadd (hip)
  • test-sycl_vecadd (ze)
  • test-ze_device (ze)

There is also an unfortunate case for test-memory_kinds when using the CUDA memory kind: it fails "gracefully" with an exit code of 0. However, since the output includes the string cuInit() failed: CUDA_ERROR_NO_DEVICE, a make dev-check (or similar, including make dev-run-tests) will report the test as FAILED due on a match on "ERROR".

Comments (4)

  1. Paul Hargrove reporter

    If the crashes of eleven tests were the only issue, I would not hesitate to resolve this issue as WONTFIX, since these are developer-only tests in which we are willing to accept UB and/or lax error checking. In some case, in fact, we probably want to see failures (such as a means to detect regressions in the non-trivial ZE device enumeration logic).

    @Dan Bonachea Your thoughts on addressing the "false alarm" attributable to CUDA_ERROR_NO_DEVICE in the output from test-memory_kinds?
    Options which come to my mind include, ordered from most to least effort:

    1. Add a mechanism analogous to WARNING_BANLIST to filter stdout and stderr from runs of tests
    2. Split off the false alarm on CUDA+test-memory_kinds as a distinct issue, to remain open, and then close this issue as WONTFIX
    3. Close this issue as WONTFIX despite this one case which might be addressed
  2. Paul Hargrove reporter

    Notes on reproducing using systems located at U.Oregon:

    The "instinct" system has both CUDA and HIP GPUs.
    I simulated the no-device cases using CUDA_VISIBLE_DEVCES=3 and ROCR_VISIBLE_DEVICES=3 to name nonexistent GPUs.

    The "headroom" system has a usable Intel GPU.
    However, I was not successful using ONEAPI_DEVICE_SELECTOR to simulate the no-device case (and was only willing to spend so much time on this). So, for the zk-kind/no-device case I used the "omnia" system which currently has the Intel kernel driver (i915) loaded, but no corresponding GPU hardware installed.

    I have notably not tested any of the three GPU memory kinds on systems where the corresponding driver is not installed.

  3. Dan Bonachea

    For the record, IMO configuring UPC++ and its tests to support a memory kind and then running GPU-relevant tests without a working GPU is pilot error.

    The tests listed in the OP all have the property that they are not written to gracefully recover in this scenario. Most of these tests (gpu_microbenchmark, h-d*, copy, cuda-context, *_vecadd) enforce their hard precondition using an assertion or other deliberate error message in the program, eg:

    UPCXX_ASSERT_ALWAYS(gpu_alloc.is_active(), "Failed to open GPU:\n" << kind_info);
    

    It's true this technically leads to an abort() on failure, but this is really just an error message that is clearly reporting to the user that they've errneously requested an run that is incompatible with their available hardware. The fact the test driver interprets such runs as a test failure is a feature, because if all such failures were silently ignored we may never realize the GPU was not actually being exercised when running tests on a misconfigured system.

    A few other of these dev-tests (bad-segment-alloc, rpc-ctor-trace, ze_device) blindly assume the GPU is available when configured-in and instead assert inside libupcxx when violating a UPC++ interface precondition. This is admittedly less friendly of an error behavior, and it technically relies on UB inside UPC++ (although most such GPU existence preconditions are enforced by the library using an assert_always). However the root cause is still very much pilot error, and thus the test failure should not be silently suppressed.

    copy-cover and memory_kinds are both written "defensively" to tolerate the lack of a configured-in GPU at runtime with a non-fatal warning, mostly to demonstrate that's possible. However both of these tests exist almost entirely for the purpose of testing GPU support, so IMO running them without a working GPU usually still amounts to pilot error.

    So my vote is to label this "working as intended" and close with WONTFIX.

  4. Paul Hargrove reporter

    Resolving as WONTFIX.

    As Dan says, the only way to encounter the eleven listed errors is via "pilot error", and the make target in question is for developers only (dev- prefix). So, anyone who does encounter the errors documented here should have the sense to recognize that the source of the errors is testing on an unsuitable platform.

    IMO, if the outlier for test-memory_kinds when missing a CUDA device is ever encountered by an end-user via make check, then we can consider a distinct issue explaining how that is the expected behavior when lacking a device corresponding to the support they configured for.

  5. Log in to comment