Performance slow-down on single node between SMP and IBV backends

Issue #216 resolved
Amin M. Khan created an issue

For single node execution, I have noticed significant slow-down when switching between smp and ibv backends. On a 16 core machine with more than 60 GB RAM, with smp it took 2.5 seconds, but with ibv it took 22 seconds.

The code snippet follows on from the example mentioned in issue #215. My code only used fetch() in initialization phase, and afterwards solely worked with raw C++ pointers upcxx::local() without using any rget or rpc. In this example, all processes basically write messages to a buffer (currently using upcxx::global_ptr<std::vector<Message>>) for all the other processes, and next all processes read messages from the buffer addressed to them. (std::vector isn't a good idea in this context and will be replaced with upcxx::new_array<Message>().)

The key question here follows on from the discussion in issue #211. Can on a single node UPC++ using raw C++ pointers with upcxx::local() give us maximum performance, or is there a need for using multi-threading, for instance OpenMP? Is performance penalty expected when accessing processes' shared memory in UPC++, as compared to multiple threads in a single process?

In this particular case for single node, the work around can be to always compile with smp. But this doesn't solve the issue for two nodes scenario, as we need to build against ibv for inter-node communication.

I don't have a MWE, but the source code is currently work-in-progress, and not public yet. @bonachea you can check it here. The code to write into buffer is in worker.cpp: lines 174-180, while the code to read from buffer is in worker.cpp: 459-471. The example gets build from pagerank.cpp.

Here is the output when built against ibv backend:

$ upcxx-run -i pagerank-IBV
UPCXXLibraryVersion: 20180900
GASNetCoreLibraryName: IBV
GASNetCoreLibraryVersion: 2.0
GASNetAuxSeg_barr: 64*64
GASNetExtendedLibraryName: IBV
GASNetExtendedLibraryVersion: 2.0
GASNetGitHash: gex-2018.9.0
GASNetCompilerID: |COMPILER_FAMILY:GNU|COMPILER_VERSION:6.3.0|COMPILER_FAMILYID:1|STD:__STDC__,__STDC_VERSION__=201112L|misc:6.3.0|
GASNetSystemName: login-0-0.local
GASNetSystemTuple: x86_64-unknown-linux-gnu
GASNetConfigureArgs: '--disable-psm' '--disable-mxm' '--disable-portals4' '--disable-ofi' '--disable-parsync' '--enable-pshm' '--disable-pshm-posix' '--enable-pshm-sysv'
GASNetBuildId: Fri Feb 22 15:12:25 CET 2019 akhan
GASNetBuildTimestamp: Feb 22 2019 15:18:46
GASNetToolsConfig: RELEASE=2018.9.0,SPEC=1.12,PTR=64bit,nodebug,SEQ,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native
GASNetToolsThreadModel: SEQ
GASNetVISMinPackBuffer: 8192
GASNetVISNPAM: 0
GASNetStridedDirectDims: 15
GASNetStridedLoopingDims: 8
GASNetStridedVersion: 2.0
GASNetAuxSeg_coll: GASNET_COLL_SCRATCH_SIZE:(2*(1024*1024))
GASNetConduitName: IBV
GASNetConfig: (libgasnet.a) RELEASE=2018.9.0,SPEC=0.6,CONDUIT=IBV(IBV-2.0/IBV-2.0),THREADMODEL=SEQ,SEGMENT=FAST,PTR=64bit,noalign,pshm,nodebug,notrace,nostats,nodebugmalloc,nosrclines,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native
GASNetSegment: GASNET_SEGMENT_FAST
GASNetThreadModel: GASNET_SEQ
GASNetAPIVersion: 1
GASNetEXAPIVersion: 0.6
GASNetDefaultMaxSegsizeStr: 0.85/H
GASNetMPISpawner: 1
GASNetSSHSpawner: 1

Here is the output when built against smp backend:

$ upcxx-run -i pagerank-SMP
UPCXXLibraryVersion: 20180900
GASNetCoreLibraryName: SMP
GASNetCoreLibraryVersion: 2.0
GASNetAuxSeg_barr: 64*64
GASNetExtendedLibraryName: SMP
GASNetExtendedLibraryVersion: 2.0
GASNetGitHash: gex-2018.9.0
GASNetCompilerID: |COMPILER_FAMILY:GNU|COMPILER_VERSION:6.3.0|COMPILER_FAMILYID:1|STD:__STDC__,__STDC_VERSION__=201112L|misc:6.3.0|
GASNetSystemName: login-0-0.local
GASNetSystemTuple: x86_64-unknown-linux-gnu
GASNetConfigureArgs: '--disable-psm' '--disable-mxm' '--disable-portals4' '--disable-ofi' '--disable-parsync' '--enable-pshm' '--disable-pshm-posix' '--enable-pshm-sysv'
GASNetBuildId: Fri Feb 22 15:12:25 CET 2019 akhan
GASNetBuildTimestamp: Feb 22 2019 15:15:14
GASNetToolsConfig: RELEASE=2018.9.0,SPEC=1.12,PTR=64bit,nodebug,SEQ,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native
GASNetToolsThreadModel: SEQ
GASNetVISMinPackBuffer: 8192
GASNetStridedDirectDims: 15
GASNetStridedLoopingDims: 8
GASNetStridedVersion: 2.0
GASNetAuxSeg_coll: GASNET_COLL_SCRATCH_SIZE:(2*(1024*1024))
GASNetConduitName: SMP
GASNetConfig: (libgasnet.a) RELEASE=2018.9.0,SPEC=0.6,CONDUIT=SMP(SMP-2.0/SMP-2.0),THREADMODEL=SEQ,SEGMENT=FAST,PTR=64bit,noalign,pshm,nodebug,notrace,nostats,nodebugmalloc,nosrclines,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native
GASNetSegment: GASNET_SEGMENT_FAST
GASNetThreadModel: GASNET_SEQ
GASNetAPIVersion: 1
GASNetEXAPIVersion: 0.6
GASNetDefaultMaxSegsizeStr: 0.85/H

Comments (4)

  1. Dan Bonachea

    @aminmkhan : I don't see anything obvious in the ident output to explain a performance difference between conduits.

    However I'm not sure the std::vector representation issue is separable from the performance observation. In particular, I question whether the code is reliably correct with the current data structure setup. If the behavioral correctness is off then the performance is probably irrelevant at this stage.

    This code in particular makes me "nervous":

        void create_buffers() {
            my_push_buffers_g = upcxx::new_array<std::vector<Message<TableKey_T, ItemKey_T, Msg_T>>>(total_workers);
            my_push_buffers_dist = new upcxx::dist_object<upcxx::global_ptr<std::vector<Message<TableKey_T, ItemKey_T, Msg_T>>>>(my_push_buffers_g);
    
            their_push_buffers_g.reserve(total_workers);
            their_local_push_buffers.reserve(total_workers);
            fetch_futures.reserve(total_workers);
    
            assert(my_push_buffers_g.is_local());
            this->my_push_buffers = my_push_buffers_g.local();
        }
    

    This places std::vector object headers in shared memory, so other processes (notably co-located processes) can access those vector object headers (and for example, query size()). However the hidden data() pointer inside the vector object still references vector elements on the private malloc heap (since these vectors are using the std::allocator), so any attempt by a non-local process to access the vector elements will almost certainly do The Wrong Thing (because the private malloc heap is not shared between processes, and local pointers into that heap are just garbage to other processes). It's possible I'm misreading your code, but apply_push_incoming_local and process_push_buffer appear to be attempting this problematic form of access. Note that on a system with non-randomized VM areas such problematic accesses may happen to avoid a seg fault and simply give silently incorrect behavior.

    I previously linked you to our std::allocator replacement that induces std::vector to use the shared heap for vector elements, you might need to drop that in to get correct behavior from this version using vectors.

  2. Dan Bonachea

    our std::allocator replacement that induces std::vector to use the shared heap for vector elements

    As a further note, the placement of the elements in shared memory doesn't magically make std::vector or other containers aware of global_ptr and memory mappings - for example, the local pointer to the elements embedded in the std::vector should not be directly used by a different process in local_team(), because the shared heap may be mapped into a different base address in virtual memory on the non-allocating process. This means that naively using the element-accessors on a std::vector created by a different process is still likely to generate incorrect results. The correct way to handle this in the case of a vector is demonstrated in the example program - namely, the owning process can up-cast the vector's data() pointer to a global_ptr using try_global_ptr() (generating a portable, universal pointer). That global_ptr can then be communicated to other processes who can use the global_ptr for RMA on the elements, or (in the case of co-located processes in local_team) downcast it using global_ptr::local() to a raw pointer to the shared elements that is meaningful to that process.

  3. Amin M. Khan reporter

    Thanks @bonachea for such a detailed explanation, and yes I agree with you the performance seems to be affected by incorrect behaviour in this particular case.

    I will keep this issue open to follow-up, and test the same logic without using std::vector, instead relying directly on g_ptr<Message> with new_array (and then with std::allocator replacement).

    I will get back if I still notice any performance difference, else I will mark this resolved.

  4. Amin M. Khan reporter

    I tried std::vector using UPC++ replacement for std::allocator, and ran into similar issues initially but didn't spend much time on it (since std::vector can't be used with RMA anyways.)

    I was able to get a working version using upcxx::new_array<Message<>>(BUFFER_MAX_SIZE) approach, and the running times were similar whether using ibv or smp backend.

  5. Log in to comment