Performance slow-down on single node between SMP and IBV backends
For single node execution, I have noticed significant slow-down when switching between smp
and ibv
backends. On a 16 core machine with more than 60 GB RAM, with smp
it took 2.5 seconds, but with ibv
it took 22 seconds.
The code snippet follows on from the example mentioned in issue #215. My code only used fetch()
in initialization phase, and afterwards solely worked with raw C++ pointers upcxx::local()
without using any rget
or rpc
. In this example, all processes basically write messages to a buffer (currently using upcxx::global_ptr<std::vector<Message>>
) for all the other processes, and next all processes read messages from the buffer addressed to them. (std::vector
isn't a good idea in this context and will be replaced with upcxx::new_array<Message>()
.)
The key question here follows on from the discussion in issue #211. Can on a single node UPC++ using raw C++ pointers with upcxx::local()
give us maximum performance, or is there a need for using multi-threading, for instance OpenMP? Is performance penalty expected when accessing processes' shared memory in UPC++, as compared to multiple threads in a single process?
In this particular case for single node, the work around can be to always compile with smp
. But this doesn't solve the issue for two nodes scenario, as we need to build against ibv
for inter-node communication.
I don't have a MWE, but the source code is currently work-in-progress, and not public yet. @bonachea you can check it here. The code to write into buffer is in worker.cpp: lines 174-180, while the code to read from buffer is in worker.cpp: 459-471. The example gets build from pagerank.cpp.
Here is the output when built against ibv
backend:
$ upcxx-run -i pagerank-IBV
UPCXXLibraryVersion: 20180900
GASNetCoreLibraryName: IBV
GASNetCoreLibraryVersion: 2.0
GASNetAuxSeg_barr: 64*64
GASNetExtendedLibraryName: IBV
GASNetExtendedLibraryVersion: 2.0
GASNetGitHash: gex-2018.9.0
GASNetCompilerID: |COMPILER_FAMILY:GNU|COMPILER_VERSION:6.3.0|COMPILER_FAMILYID:1|STD:__STDC__,__STDC_VERSION__=201112L|misc:6.3.0|
GASNetSystemName: login-0-0.local
GASNetSystemTuple: x86_64-unknown-linux-gnu
GASNetConfigureArgs: '--disable-psm' '--disable-mxm' '--disable-portals4' '--disable-ofi' '--disable-parsync' '--enable-pshm' '--disable-pshm-posix' '--enable-pshm-sysv'
GASNetBuildId: Fri Feb 22 15:12:25 CET 2019 akhan
GASNetBuildTimestamp: Feb 22 2019 15:18:46
GASNetToolsConfig: RELEASE=2018.9.0,SPEC=1.12,PTR=64bit,nodebug,SEQ,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native
GASNetToolsThreadModel: SEQ
GASNetVISMinPackBuffer: 8192
GASNetVISNPAM: 0
GASNetStridedDirectDims: 15
GASNetStridedLoopingDims: 8
GASNetStridedVersion: 2.0
GASNetAuxSeg_coll: GASNET_COLL_SCRATCH_SIZE:(2*(1024*1024))
GASNetConduitName: IBV
GASNetConfig: (libgasnet.a) RELEASE=2018.9.0,SPEC=0.6,CONDUIT=IBV(IBV-2.0/IBV-2.0),THREADMODEL=SEQ,SEGMENT=FAST,PTR=64bit,noalign,pshm,nodebug,notrace,nostats,nodebugmalloc,nosrclines,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native
GASNetSegment: GASNET_SEGMENT_FAST
GASNetThreadModel: GASNET_SEQ
GASNetAPIVersion: 1
GASNetEXAPIVersion: 0.6
GASNetDefaultMaxSegsizeStr: 0.85/H
GASNetMPISpawner: 1
GASNetSSHSpawner: 1
Here is the output when built against smp
backend:
$ upcxx-run -i pagerank-SMP
UPCXXLibraryVersion: 20180900
GASNetCoreLibraryName: SMP
GASNetCoreLibraryVersion: 2.0
GASNetAuxSeg_barr: 64*64
GASNetExtendedLibraryName: SMP
GASNetExtendedLibraryVersion: 2.0
GASNetGitHash: gex-2018.9.0
GASNetCompilerID: |COMPILER_FAMILY:GNU|COMPILER_VERSION:6.3.0|COMPILER_FAMILYID:1|STD:__STDC__,__STDC_VERSION__=201112L|misc:6.3.0|
GASNetSystemName: login-0-0.local
GASNetSystemTuple: x86_64-unknown-linux-gnu
GASNetConfigureArgs: '--disable-psm' '--disable-mxm' '--disable-portals4' '--disable-ofi' '--disable-parsync' '--enable-pshm' '--disable-pshm-posix' '--enable-pshm-sysv'
GASNetBuildId: Fri Feb 22 15:12:25 CET 2019 akhan
GASNetBuildTimestamp: Feb 22 2019 15:15:14
GASNetToolsConfig: RELEASE=2018.9.0,SPEC=1.12,PTR=64bit,nodebug,SEQ,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native
GASNetToolsThreadModel: SEQ
GASNetVISMinPackBuffer: 8192
GASNetStridedDirectDims: 15
GASNetStridedLoopingDims: 8
GASNetStridedVersion: 2.0
GASNetAuxSeg_coll: GASNET_COLL_SCRATCH_SIZE:(2*(1024*1024))
GASNetConduitName: SMP
GASNetConfig: (libgasnet.a) RELEASE=2018.9.0,SPEC=0.6,CONDUIT=SMP(SMP-2.0/SMP-2.0),THREADMODEL=SEQ,SEGMENT=FAST,PTR=64bit,noalign,pshm,nodebug,notrace,nostats,nodebugmalloc,nosrclines,timers_native,membars_native,atomics_native,atomic32_native,atomic64_native
GASNetSegment: GASNET_SEGMENT_FAST
GASNetThreadModel: GASNET_SEQ
GASNetAPIVersion: 1
GASNetEXAPIVersion: 0.6
GASNetDefaultMaxSegsizeStr: 0.85/H
Comments (4)
-
-
our std::allocator replacement that induces std::vector to use the shared heap for vector elements
As a further note, the placement of the elements in shared memory doesn't magically make
std::vector
or other containers aware of global_ptr and memory mappings - for example, the local pointer to the elements embedded in thestd::vector
should not be directly used by a different process inlocal_team()
, because the shared heap may be mapped into a different base address in virtual memory on the non-allocating process. This means that naively using the element-accessors on astd::vector
created by a different process is still likely to generate incorrect results. The correct way to handle this in the case of a vector is demonstrated in the example program - namely, the owning process can up-cast the vector'sdata()
pointer to aglobal_ptr
usingtry_global_ptr()
(generating a portable, universal pointer). That global_ptr can then be communicated to other processes who can use theglobal_ptr
for RMA on the elements, or (in the case of co-located processes in local_team) downcast it usingglobal_ptr::local()
to a raw pointer to the shared elements that is meaningful to that process. -
reporter Thanks @bonachea for such a detailed explanation, and yes I agree with you the performance seems to be affected by incorrect behaviour in this particular case.
I will keep this issue open to follow-up, and test the same logic without using
std::vector
, instead relying directly ong_ptr<Message>
withnew_array
(and then with std::allocator replacement).I will get back if I still notice any performance difference, else I will mark this resolved.
-
reporter - changed status to resolved
I tried
std::vector
using UPC++ replacement forstd::allocator
, and ran into similar issues initially but didn't spend much time on it (sincestd::vector
can't be used with RMA anyways.)I was able to get a working version using
upcxx::new_array<Message<>>(BUFFER_MAX_SIZE)
approach, and the running times were similar whether usingibv
orsmp
backend. - Log in to comment
@aminmkhan : I don't see anything obvious in the ident output to explain a performance difference between conduits.
However I'm not sure the std::vector representation issue is separable from the performance observation. In particular, I question whether the code is reliably correct with the current data structure setup. If the behavioral correctness is off then the performance is probably irrelevant at this stage.
This code in particular makes me "nervous":
This places
std::vector
object headers in shared memory, so other processes (notably co-located processes) can access those vector object headers (and for example, querysize()
). However the hiddendata()
pointer inside the vector object still references vector elements on the private malloc heap (since these vectors are using the std::allocator), so any attempt by a non-local process to access the vector elements will almost certainly do The Wrong Thing (because the private malloc heap is not shared between processes, and local pointers into that heap are just garbage to other processes). It's possible I'm misreading your code, butapply_push_incoming_local
andprocess_push_buffer
appear to be attempting this problematic form of access. Note that on a system with non-randomized VM areas such problematic accesses may happen to avoid a seg fault and simply give silently incorrect behavior.I previously linked you to our std::allocator replacement that induces std::vector to use the shared heap for vector elements, you might need to drop that in to get correct behavior from this version using vectors.