Data race

Issue #148 resolved
Scott Baden created an issue

I have a jacobi 1d application, a very naive implementation and I'm getting a data race that I cannot explain. The code is in my repo (scott)/rb1d/upcxx/jac1d-ndist.cpp

i've attached the code so you can take a look. The code fails if you remove the barrier at line 86 (Comment: Why do we need this barrier?)

The problem may be a misunderstanding about storage definition involving globals. The race arises in RPCs involving global_ptrs that are global

Scott

Comments (5)

  1. Scott Baden reporter

    NVM. A subtlety in convering code that uses a dist_obj to code that does not. But this did cause me to wonder. If I use an rpc to modify a global variable in another rank, but that rank hasn't started up yet, then couldn't I modify that variable before the initialization occurs? Then the initialization does in fact take place and I've just wiped out the modification made by the RPC. See line 86, in which the barrier is commented out.

  2. john bachan

    Actually your code is still wrong even with the barrier. The barrier needs to move after the rpc waits have completed. Currently you are incorrectly assuming that just because youve received acknowledgment that your outgoing data has landed, that your inbound data has arrived.

  3. Scott Baden reporter

    Correctemundo. The barrier appears at line 180. It is used for another purpose- timing. Probably best if I add an extra barrier after rpcs to avoid mishaps in using code out of context. Thnx

  4. Log in to comment