vec.setArray() for CUSP vectors only sets host-side array
How to reproduce this bug: https://gist.github.com/ashwinsrnth/dbf524466186c3c86fba
Comments (31)
-
-
reporter Thanks Lisandro - this doesn't work. And if I recall the implementation, the input to
restoreCUDAHandle
isn't used anyway. -
Oh, you are right! Sorry for the confusion. Well, I guess I need to think a little more about it.
Ultimately, this issue is not just in petsc4py, but in core PETSc. Maybe
VecCUSPGetCUDAArray()
should check the CPU/GPU status and copy CPU->GPU if needed before returning the CUDA array to the user.@BarryFSmith What do you think? Am I missing something?
-
@karlrupp @dmeiser
Beats me. The documentation for VecCUSPGetCUDAArray() is so thin that it is not clear how it should behave. I note that VecCUSPGetArrayReadWrite() does do the copy down but that VecCUSPGetArrayWrite() does not since it assuming the the user will write into the space. I do not know why VecCUSPGetCUDAArray() calls VecCUSPGetArrayWrite() instead of for example VecCUSPGetArrayReadWrite().
-
So basically PETSc is assuming that getting an array for write implies that the user will write ALL entries of the array. Mmm, I don't like such assumption. Imagine you have a Vec, you set boundary conditions in the CPU, then you call the GPU to update values corresponding to non-boundary nodes, then the BCs are lost and you get garbage instead. Am I right?
-
@BarryFSmith you are right, the fundamental problem is that the cache invalidation behaviour of functions like
VecCUSPGetCUDAArray()
is currently implementation defined. It would be nice to have it specified somewhere.My interpretation has been that
VecCUSPGetCUDAArray()
just returns the pointer to the underlying cuda array without worrying about synchronization of data between host and device. If such synchronization is desired, the user needs to useVecCUSPGetArrayReadWrite()
. As an implementation detail,VecCUSPGetCUDAArray()
delegates toVecCUSPGetArrayWrite()
because that handles allocation of the cuda buffer if it hasn't previously been allocated.@dalcinl Does
VecCUSPGetArrayReadWrite()
do what you want in this scenario? Perhaps this is an issue of poorly chosen names and lack of documentation? -
@dmeiser Yes, I think it does. Moreover, I agree with @BarryFSmith, that's the right thing to do. Would you all agree to push this change to
petsc/maint
?@ashwinsrinath Could you modify
VecCUSPGetCUDAArray()
to useVecCUSPGetArrayReadWrite()
and let us know whether things work as expected? BTW, also remove theVecCUSPAllocateCheck()
call, it seems to be not necessary. -
reporter Lisandro - yes, I can try that
-
@dalcinl I think that would be fine. It may cause a minor performance degradation in code that doesn't require the GPU data to be up to date but it shouldn't break any code.
To summarize, the semantics that we're aiming for is:
VecCUSPGetArrayRead()
andVecCUSPGetArrayReadWrite()
synchronize with the host if needed before returning a device pointer (analogous toVecGetArray()
andVecGetArrayRead()
).VecCUSPGetArrayReadWrite()
invalidates the data on the host side whileVecCUSPGetArrayRead()
doesn't.
Do we want to keep
VecCUSPGetCUDAArray()
with the current semantics (i.e. no synchronization with the host)? This would be a dangerous expert function. If we defuse this function by adding synchronization it would be redundant withVecCUSPGetArrayReadWrite()
. In that case I would vote for removingVecCUSPGetCUDAArray()
. Or am I misinterpreting the purpose ofVecCUSPGetCUDAArray()
? -
reporter Dominic -
VecCUSPGetCUDAArray
is essential if you want to access the underlying CUDA array without the use of CUSP. Because CUSP has no Python bindings, it is essentially the only way to access it from Python. -
@ashwinsrinath but doesn't
VecCUSPGetArrayReadWrite()
provide you with that pointer , too? -
reporter I may be wrong, but it looks like what
VecCUSPGetArrayReadWrite
is doing is returning the pointer to the CUSP array. A thrustraw_pointer_cast
must applied to this to get the pointer to the raw array. This is whatVecCUSPGetCUDAArray
is doing since it can't be done in Python.Have I misunderstood?
-
@ashwinsrinath yes you are right, that makes sense. In that case I suppose we want
VecCUSPGetCUDAArray()
to also synchronize with the host if necessary.I'll open a pull request shortly.
-
Should we have the same three versions of VecCUSPGetCUDA that we have for VecCUSPGet ? A read, a write and a readwrite? Why is there only one "unclear what it does" VecCUSPGetCUDA without a read, write, readwrite?
-
@BarryFSmith IIRC,
VecCUSPGetCUDAArray()
was added to properly implement the Python wrappers. I think for C++ codeVecCUSPGetArrayXXX()
is nicer (as you can benefit from the thrust library immediately). So, I guess having justVecCUSPGetCUDAArray()
with the right, safe semantics we agreed above should be enough for now. Of course, we can improve later, but only if really required. -
So it is really not intended for users? Should it be marked as Level: developer?
What about documentation; currently it says "Provides write access to the CUDA buffer inside a vector." So in fact there is nothing wrong with the current code in terms of correctness. If you want to read from the CUDA buffer you cannot use this function since it is only for writing?
Can someone make a pull request with any proposed changes?
-
Sorry, took me a while to get the documentation to build correctly. PR is here:
https://bitbucket.org/petsc/petsc/pull-requests/395/add-veccuspgetcudaarrayread-write-and/diff
-
reporter @dalcinl - I was able to test the PR submitted by @dmeiser (thanks!). While it fixes some problems, it still doesn't improve the situation with
vec.setArray()
in petsc4py.vec.setArray()
only sets the host side vector. Callingview()
on a vector presumably copies data fom the device to the host before printing. So doing in sequence:v.setArray([1,2,3]) v.view()
will actually print the (outdated) device vector. To make it work, I have to do:
v.setArray([1,2,3]) h = v.getCUDAHandle() # this copies data from the host to the device if it is out of date v.view()
Looking at the Cython code in
petscvec.pxi
, I think one approach is thatvec_setarray
needs to be able to handle the special case of a CUSP vector. I'm not sure if this is an elegant or idiomatic solution, but I'm struggling to think of a better one. -
Just pushed the change to use VecCUSP{Get|Restore}ArrayReadWrite() for CUDA handles.
About your comments related to
vec_setarray
in Cython code, I'm still not sure.Vec.setArray()
is supposed to behave close to to the C counterpart. Isn'tv.setArray(...); v.restoreCUDAHandle(v.getCUDAHandle())
enough to ensure synchronization? -
reporter Yes - that will do the job, but it feels a bit awkward to me. Anyway, it works for now. Thanks so much Lisandro!
-
- changed status to resolved
-
@ashwinsrinath the example you give (
v.setArray()
followed byv.view()
) is supposed to work. I've had a quick look at the code in petsc4py and petsc but can't see why it's not working. I'll try to build petsc4py (haven't done that in a while) and will try to reproduce this. It would be very helpful if you could post a complete but minimal python script that reproduces the issue. Thanks. -
reporter @dmeiser - sorry for the confusion. Yes,
v.setArray()
followed byv.view()
will work. But I wasn't sure if this was correctly setting the device side vector, or only the host side vector. To confirm my suspicion, I create a PyCUDA GPUArray that shares the buffer with the petsc4py device-side vec, and look at that instead. I get some funny behaviour here. For example, see here:https://gist.github.com/ashwinsrnth/c80d429e240eeacac30a
Without the call to
getCUDAHandle
, the GPUArray isn't updated when the vec is. -
Ok, that makes sense. This is the expected behavior.
To provide some background, the PETSc CUSP vectors cache data on the CPU and the GPU. Data is transferred from one memory space to the other lazily, i.e. only if necessary. In your case data is up to date on the CPU after
v.setArray()
and it is stale on the GPU. If you were to request a handle to the GPU data viaVecCUSPGetCUDAArrayRead()
or similar the data gets properly synchronized to the GPU before the cuda device pointer is returned. If you obtain the cuda pointer by some other mechanism (e.g. you store it from some previous call toVecCUSPGetCUDAArrayRead()
) you're bypassing this caching mechanism.I'd recommend always using
getCUDAHandle
when you need a device pointer and to release that pointer as soon as you don't need it anymore. These functions are relatively light weight and automatically take care of the synchronization without copying data unnecessarily. -
reporter Cool, this makes sense - thanks!
-
Folks, maybe we should extend `{get|restore}CUDAHandle()
with an additonal
mode`` argument to handle the read/write/read-write intentions available in the C API? -
reporter I think that would be more Pythonic
-
@ashwinsrinath Could you please test this branch? https://bitbucket.org/petsc/petsc4py/branch/dalcinl/cuda-handle Do you like the implementation?
-
reporter @dalcinl I found no problems. But - if I'm understanding correctly - all modes currently behave the same and allow both read and write access?
-
Well, the different modes use the corresponding
Vec{Get|Restore}CUDAArray{Read|Write|ReadWrite}()
, and that has implications in the synchronization semantics of GPU and CPU buffers. Of course, once you have the GPU handle, nothing prevents you from either reading or writing to the GPU memory. In general, you should use a mode that match your intention. Am I being clear enough? -
reporter @dalcinl Sorry for the late response - I understand now.
- Log in to comment
You should first
gpu_handle = v.getCUDAHandle()
, thenv.restoreCUDAHandle(gpu_handle)
. Could you please give it a try?