vec.setArray() for CUSP vectors only sets host-side array

Lisandro Dalcin

You should first gpu_handle = v.getCUDAHandle(), then v.restoreCUDAHandle(gpu_handle). Could you please give it a try?

2016-01-10T07:19:45+00:00

Ashwin Srinath reporter

Thanks Lisandro - this doesn't work. And if I recall the implementation, the input to restoreCUDAHandle isn't used anyway.

2016-01-11T17:59:02+00:00

Lisandro Dalcin

Oh, you are right! Sorry for the confusion. Well, I guess I need to think a little more about it.

Ultimately, this issue is not just in petsc4py, but in core PETSc. Maybe VecCUSPGetCUDAArray() should check the CPU/GPU status and copy CPU->GPU if needed before returning the CUDA array to the user.

@BarryFSmith What do you think? Am I missing something?

2016-01-11T18:56:19+00:00

BarryFSmith

@karlrupp @dmeiser

Beats me. The documentation for VecCUSPGetCUDAArray() is so thin that it is not clear how it should behave. I note that VecCUSPGetArrayReadWrite() does do the copy down but that VecCUSPGetArrayWrite() does not since it assuming the the user will write into the space. I do not know why VecCUSPGetCUDAArray() calls VecCUSPGetArrayWrite() instead of for example VecCUSPGetArrayReadWrite().

2016-01-11T19:11:16+00:00

Lisandro Dalcin

So basically PETSc is assuming that getting an array for write implies that the user will write ALL entries of the array. Mmm, I don't like such assumption. Imagine you have a Vec, you set boundary conditions in the CPU, then you call the GPU to update values corresponding to non-boundary nodes, then the BCs are lost and you get garbage instead. Am I right?

2016-01-11T19:18:28+00:00

Dominic Meiser

@BarryFSmith you are right, the fundamental problem is that the cache invalidation behaviour of functions like VecCUSPGetCUDAArray() is currently implementation defined. It would be nice to have it specified somewhere.

My interpretation has been that VecCUSPGetCUDAArray() just returns the pointer to the underlying cuda array without worrying about synchronization of data between host and device. If such synchronization is desired, the user needs to use VecCUSPGetArrayReadWrite(). As an implementation detail, VecCUSPGetCUDAArray() delegates to VecCUSPGetArrayWrite() because that handles allocation of the cuda buffer if it hasn't previously been allocated.

@dalcinl Does VecCUSPGetArrayReadWrite() do what you want in this scenario? Perhaps this is an issue of poorly chosen names and lack of documentation?

2016-01-11T19:27:07+00:00

Lisandro Dalcin

@dmeiser Yes, I think it does. Moreover, I agree with @BarryFSmith, that's the right thing to do. Would you all agree to push this change to petsc/maint?

@ashwinsrinath Could you modify VecCUSPGetCUDAArray() to use VecCUSPGetArrayReadWrite() and let us know whether things work as expected? BTW, also remove the VecCUSPAllocateCheck() call, it seems to be not necessary.

2016-01-12T08:30:29+00:00

Ashwin Srinath reporter

Lisandro - yes, I can try that

2016-01-12T13:55:56+00:00

Dominic Meiser

@dalcinl I think that would be fine. It may cause a minor performance degradation in code that doesn't require the GPU data to be up to date but it shouldn't break any code.

To summarize, the semantics that we're aiming for is:

VecCUSPGetArrayRead() and VecCUSPGetArrayReadWrite() synchronize with the host if needed before returning a device pointer (analogous to VecGetArray() and VecGetArrayRead()).
VecCUSPGetArrayReadWrite() invalidates the data on the host side while VecCUSPGetArrayRead() doesn't.

Do we want to keep VecCUSPGetCUDAArray() with the current semantics (i.e. no synchronization with the host)? This would be a dangerous expert function. If we defuse this function by adding synchronization it would be redundant with VecCUSPGetArrayReadWrite(). In that case I would vote for removing VecCUSPGetCUDAArray(). Or am I misinterpreting the purpose of VecCUSPGetCUDAArray()?

2016-01-12T15:49:41+00:00

Ashwin Srinath reporter

Dominic - VecCUSPGetCUDAArray is essential if you want to access the underlying CUDA array without the use of CUSP. Because CUSP has no Python bindings, it is essentially the only way to access it from Python.

2016-01-12T16:12:19+00:00

Dominic Meiser

@ashwinsrinath but doesn't VecCUSPGetArrayReadWrite() provide you with that pointer , too?

2016-01-12T16:20:00+00:00

Ashwin Srinath reporter

I may be wrong, but it looks like what VecCUSPGetArrayReadWrite is doing is returning the pointer to the CUSP array. A thrust raw_pointer_cast must applied to this to get the pointer to the raw array. This is what VecCUSPGetCUDAArray is doing since it can't be done in Python.

Have I misunderstood?

2016-01-12T16:27:42+00:00

Dominic Meiser

@ashwinsrinath yes you are right, that makes sense. In that case I suppose we want VecCUSPGetCUDAArray() to also synchronize with the host if necessary.

I'll open a pull request shortly.

2016-01-12T16:38:08+00:00

BarryFSmith

Should we have the same three versions of VecCUSPGetCUDA that we have for VecCUSPGet ? A read, a write and a readwrite? Why is there only one "unclear what it does" VecCUSPGetCUDA without a read, write, readwrite?

2016-01-12T16:58:15+00:00

Lisandro Dalcin

@BarryFSmith IIRC, VecCUSPGetCUDAArray() was added to properly implement the Python wrappers. I think for C++ code VecCUSPGetArrayXXX() is nicer (as you can benefit from the thrust library immediately). So, I guess having just VecCUSPGetCUDAArray() with the right, safe semantics we agreed above should be enough for now. Of course, we can improve later, but only if really required.

2016-01-12T20:36:37+00:00

BarryFSmith

So it is really not intended for users? Should it be marked as Level: developer?

What about documentation; currently it says "Provides write access to the CUDA buffer inside a vector." So in fact there is nothing wrong with the current code in terms of correctness. If you want to read from the CUDA buffer you cannot use this function since it is only for writing?

Can someone make a pull request with any proposed changes?

2016-01-12T20:45:02+00:00

Dominic Meiser

Sorry, took me a while to get the documentation to build correctly. PR is here:

https://bitbucket.org/petsc/petsc/pull-requests/395/add-veccuspgetcudaarrayread-write-and/diff

2016-01-12T20:58:16+00:00

Ashwin Srinath reporter

@dalcinl - I was able to test the PR submitted by @dmeiser (thanks!). While it fixes some problems, it still doesn't improve the situation with vec.setArray() in petsc4py. vec.setArray() only sets the host side vector. Calling view() on a vector presumably copies data fom the device to the host before printing. So doing in sequence:

v.setArray([1,2,3])
v.view()

will actually print the (outdated) device vector. To make it work, I have to do:

v.setArray([1,2,3])
h = v.getCUDAHandle() # this copies data from the host to the device if it is out of date
v.view()

Looking at the Cython code in petscvec.pxi, I think one approach is that vec_setarray needs to be able to handle the special case of a CUSP vector. I'm not sure if this is an elegant or idiomatic solution, but I'm struggling to think of a better one.

2016-01-19T03:01:58+00:00

Lisandro Dalcin

Just pushed the change to use VecCUSP{Get|Restore}ArrayReadWrite() for CUDA handles.

About your comments related to vec_setarray in Cython code, I'm still not sure. Vec.setArray() is supposed to behave close to to the C counterpart. Isn't v.setArray(...); v.restoreCUDAHandle(v.getCUDAHandle()) enough to ensure synchronization?

2016-01-19T09:00:45+00:00

Ashwin Srinath reporter

Yes - that will do the job, but it feels a bit awkward to me. Anyway, it works for now. Thanks so much Lisandro!

2016-01-19T12:24:10+00:00

Lisandro Dalcin

changed status to resolved

2016-01-19T12:40:32+00:00

Dominic Meiser

@ashwinsrinath the example you give (v.setArray() followed by v.view()) is supposed to work. I've had a quick look at the code in petsc4py and petsc but can't see why it's not working. I'll try to build petsc4py (haven't done that in a while) and will try to reproduce this. It would be very helpful if you could post a complete but minimal python script that reproduces the issue. Thanks.

2016-01-19T15:24:58+00:00

Ashwin Srinath reporter

@dmeiser - sorry for the confusion. Yes, v.setArray() followed by v.view() will work. But I wasn't sure if this was correctly setting the device side vector, or only the host side vector. To confirm my suspicion, I create a PyCUDA GPUArray that shares the buffer with the petsc4py device-side vec, and look at that instead. I get some funny behaviour here. For example, see here:

https://gist.github.com/ashwinsrnth/c80d429e240eeacac30a

Without the call to getCUDAHandle, the GPUArray isn't updated when the vec is.

2016-01-19T17:20:24+00:00

Dominic Meiser

Ok, that makes sense. This is the expected behavior.

To provide some background, the PETSc CUSP vectors cache data on the CPU and the GPU. Data is transferred from one memory space to the other lazily, i.e. only if necessary. In your case data is up to date on the CPU after v.setArray() and it is stale on the GPU. If you were to request a handle to the GPU data via VecCUSPGetCUDAArrayRead() or similar the data gets properly synchronized to the GPU before the cuda device pointer is returned. If you obtain the cuda pointer by some other mechanism (e.g. you store it from some previous call to VecCUSPGetCUDAArrayRead()) you're bypassing this caching mechanism.

I'd recommend always using getCUDAHandle when you need a device pointer and to release that pointer as soon as you don't need it anymore. These functions are relatively light weight and automatically take care of the synchronization without copying data unnecessarily.

2016-01-19T17:33:24+00:00

Ashwin Srinath reporter

Cool, this makes sense - thanks!

2016-01-19T20:59:53+00:00

Lisandro Dalcin

Folks, maybe we should extend `{get|restore}CUDAHandle()with an additonalmode`` argument to handle the read/write/read-write intentions available in the C API?

2016-01-20T06:10:42+00:00

Ashwin Srinath reporter

I think that would be more Pythonic

2016-01-20T14:28:26+00:00

Lisandro Dalcin

@ashwinsrinath Could you please test this branch? https://bitbucket.org/petsc/petsc4py/branch/dalcinl/cuda-handle Do you like the implementation?

2016-01-21T08:04:13+00:00

Ashwin Srinath reporter

@dalcinl I found no problems. But - if I'm understanding correctly - all modes currently behave the same and allow both read and write access?

2016-01-21T20:32:42+00:00

Lisandro Dalcin

Well, the different modes use the corresponding Vec{Get|Restore}CUDAArray{Read|Write|ReadWrite}(), and that has implications in the synchronization semantics of GPU and CPU buffers. Of course, once you have the GPU handle, nothing prevents you from either reading or writing to the GPU memory. In general, you should use a mode that match your intention. Am I being clear enough?

2016-01-22T10:10:58+00:00

Ashwin Srinath reporter

@dalcinl Sorry for the late response - I understand now.

2016-01-25T17:20:57+00:00

Comments (31)