Data race
I have a jacobi 1d application, a very naive implementation and I'm getting a data race that I cannot explain. The code is in my repo (scott)/rb1d/upcxx/jac1d-ndist.cpp
i've attached the code so you can take a look. The code fails if you remove the barrier at line 86 (Comment: Why do we need this barrier?)
The problem may be a misunderstanding about storage definition involving globals. The race arises in RPCs involving global_ptrs that are global
Scott
Comments (5)
-
reporter -
What racy behavior happens when that barrier is removed?
-
Actually your code is still wrong even with the barrier. The barrier needs to move after the rpc waits have completed. Currently you are incorrectly assuming that just because youve received acknowledgment that your outgoing data has landed, that your inbound data has arrived.
-
- changed status to resolved
-
reporter Correctemundo. The barrier appears at line 180. It is used for another purpose- timing. Probably best if I add an extra barrier after rpcs to avoid mishaps in using code out of context. Thnx
- Log in to comment
NVM. A subtlety in convering code that uses a dist_obj to code that does not. But this did cause me to wonder. If I use an rpc to modify a global variable in another rank, but that rank hasn't started up yet, then couldn't I modify that variable before the initialization occurs? Then the initialization does in fact take place and I've just wiped out the modification made by the RPC. See line 86, in which the barrier is commented out.