Add `cuda_device::device_n()`

Issue #188 resolved

Dan Bonachea created an issue 2022-01-22

Problem Statement

Constructing a functional upcxx::cuda_device in UPC++ requires an (0-based) integer device ID that you want to open.

UPC++ currently provides no help in identifying the IDs valid for use in a given process. Any UPC++ process that wants to automatically make use of more than one CUDA GPU when available needs to query how many are available.

Querying the number of CUDA devices available for use in UPC++ (with moderate/reasonable error checking) currently looks something like this:

#if UPCXX_KIND_CUDA
  #include <cuda.h>
#endif
...
int cuda_devices = 0;
void print_error_and_abort() {
  const char *errname="", *errstr="";
  cuGetErrorName(res, &errname);
  cuGetErrorString(res, &errstr);
  std::cerr << "CUDA call failed! error="<<errname<<": "<<errstr;
  std::abort();
}
#if UPCXX_KIND_CUDA
  if (cuInit(0) != CUDA_SUCCESS) print_error_and_abort();
  if (cuDeviceGetCount(&cuda_devices) != CUDA_SUCCESS) print_error_and_abort();
#endif

This is tedious and nasty. We have variants of the code above embedded in several tests, and still more tests/example that just hard-code single-GPU operation, partly because the query is tedious.

The same problem occurs (with different driver calls) for AMD and Intel GPUs, which means a multi-GPU-capable program written for any of those kinds may eventually end up embedding three distinct variants of the code above.

I propose to replace the code above with the following (for the case of cuda_device):

int cuda_devices = cuda_device::device_n();

and likely analogous calls for future GPU kinds.

Proposal:

static cuda_device::id_type cuda_device::device_n();

Semantics

Returns the number of valid CUDA device IDs available to the calling process. Valid device IDs are integers in the range [0,device_n()).

This function may be called when UPC++ is in the uninitialized state.
UPC++ progress level: none

Comments (8)

Paul Hargrove
The "semantics" of the example code includes initializing the library to the point that its "GetCount" function may be called, but it (currently) does not perform any balancing "fini". No fini is required with CUDA, and calling cuInit(0) multiple times is harmless. I don't have confidence that either property is universal.

What would our implementation approach be if for some future device library calling the library initializer multiple times was not permitted?
What if calling it multiple times is permitted, but reference counted such that it required balancing finalization calls?
In both cases my thought is that the user code could/should be permitted to perform their own initialization either before or after making the new call. Does that "fit" with this proposed call? (I think I know the answer, but want to hear it to be sure).
- 2022-01-23T00:03:44+00:00
Paul Hargrove
I think I'd prefer to see the error case return a negative value derived from the library's error return (depending on CUDA vs other libs, the error code might be positive or negative, but I've seen no API in which both signs were used). I could make arguments for returning 0 on error, or even for throwing an exception. However, I think a device-specific negative error code provides more information and so is my preference. I would also accept returning -1 and dropping the specific error code, if that is more palatable.

By making error cases non-fatal, we would hypothetically give the (advanced) user the option to proceed without a GPU if something is preventing initialization.
- 2022-01-23T00:11:38+00:00
Dan Bonachea reporter
user code could/should be permitted to perform their own initialization either before or after making the new call. Does that "fit" with this proposed call

Yes that is the (implied) intent.

Note this property applies not only to this proposed function, but also to the existing cuda_device constructor, which also needs to init CUDA as part of opening the device; user code invoking the CUDA driver interface to perform other operations might similarly need to cuInit() for calls before constructing the first cuda_device.

The current spec doesn't spell this out because (as you observe) 1) cuInit() is idempotent and 2) the CUDA runtime API (which is more commonly used in application code) performs implicit initialization (there is no init call). The UPC++ cuda_device memory kind emulates this "implicit init" behavior that most CUDA programmers are familiar with, so IMO it's not worth adding verbiage to clarify this.

I think I'd prefer to see the error case return a negative value By making error cases non-fatal, we would hypothetically give the (advanced) user ...

I strongly disagree.

cuda_device construction initializes the CUDA Driver API (when necessary) and currently any errors are immediately fatal (with a very nice/verbose error message), and I see no compelling reason to deviate from that (deliberately unspecified) behavioral policy here. Most CUDA errors arising elsewhere in our runtime are also treated as immediately fatal with a user-friendly message (at least in codemode debug). If you disagree with this current behavior feel free to open a NEW issue in the implementation repo to make your case; that policy is orthogonal to this proposal. The addition of this query allows apps to easily avoid the most important recoverable error (asking for a device that does not exist).

IMO code that wants to be more paranoid can init CUDA themselves and ask it whatever they want. The point of this function is to make it easy to correctly open multiple devices under normal conditions. IOW this API is designed for programmability of the overwhelmingly common case, and is not intended to satisfy the "advanced user" who may want to issue CUDA queries for all sorts of information that are firmly outside the scope of UPC++.
- 2022-01-23T00:27:57+00:00
Amir Kamil
However, I think a device-specific negative error code provides more information and so is my preference.

It might make sense to define a static member constant as a device-specific error code, e.g. cuda_device::error.
- 2022-01-23T00:29:13+00:00
Paul Hargrove
@Dan Bonachea You strongly disagreed with one of my two comment and failed to respond to the the intended question in the other.

I will drop the error-return suggestion. As you fall just short of stating explicitly: a user who wants my behavior is free to implement it on their own external to our library.

In response to my concern over initialization semantics, you have stated the "(implied) intent" matches the goal I expressed, but not addressed how/why you believe your proposal is implementable given a library with an initializer with either or both of the properties I mention. That is the question for which "I think I know the answer, but want to hear it to be sure." In fact, rather then dispel my concerns, you have deepened them by reminding me that they also apply to our cuda_device construction. Sorry if I was not clear that I asking more for how it would be made to fit than just "yes it fits" with a rationale that "it must because we assume elsewhere that it does."
- 2022-01-23T00:49:19+00:00
Dan Bonachea reporter
how/why you believe your proposal is implementable given a library with an initializer with either or both of the properties I mention

Apologies for not clarifying. The scope of this proposal is strictly for adding a method to the existing cuda_device class. We seem to be in agreement that the CUDA initialization semantics are simple/trivial to satisfy and have been a non-issue for cuda_device construction in practice since we first released the memory kind in 2019.3.0.

When it comes time to discuss design proposals for a new device class (coming soon for HIP) we'll need to debate the specification and behavior of ALL the methods, where providing a method with functionality analogous to cuda_device::device_n() would just be one small part of that exercise. It's not impossible that driver behavior for some future memory kind would require us to include an explicit static initialization call in a new memory kind class, but that does not apply to cuda_device so is irrelevant to this issue (which I'd really like to keep focused).

I'm already planning a discussion of "design plans" for future device memory kinds in our next meeting, but let's please stop co-opting this narrow enhancement issue for that purpose.
- 2022-01-23T01:10:00+00:00
Paul Hargrove
When it comes time to discuss design proposals for a new device class (coming soon for HIP) we'll need to debate the specification and behavior of ALL the methods [....]

[Possible behaviors of other device libraries are] irrelevant to this issue (which I'd really like to keep focused).

Understood.
I have no valid concerns regarding this proposal as a convenience for users of cuda_device.
I also acknowledge that this issue is not a proper venue to discuss our ability to support other devices with (so far) hypothetical init/fini requirements.
- 2022-01-23T01:19:15+00:00
Dan Bonachea reporter
- changed status to resolved
Specify cuda_device::device_n()

Resolves issue ~~#188~~

→ <<cset cb488e4bdb35>>
- 2022-02-20T02:43:36+00:00
Log in to comment

Assignee: Dan Bonachea

Type: enhancement

Priority: blocker

Status: resolved

Component: Memory Kinds

Milestone: 2022.3.0 release

Version: 2021.9.0

Votes: 0

Watchers: 1