TBB malloc?
Especially @cbeall3 and @richardroberts . I read about Automatically Replacing malloc and Other C/C++ Functions for Dynamic Memory Allocation and other pages that maintain that TBB malloc is way faster than built-in malloc. When we link/compile with TBB, are we then also using that? I think not, right?
BTW I tried a drop in replacement with DYLD_INSERT_LIBRARIES as described here, but
gtsam_unstable/timing>setenv DYLD_INSERT_LIBRARIES /opt/intel/tbb/lib/libtbbmalloc_proxy.dylib
gtsam_unstable/timing>./timeCameraExpression
Segmentation fault
Comments (16)
-
-
reporter But that is only for STL collections, no?
-
Yup
-
reporter So, what about new/delete of objects?
-
You probably want to try replacing make_shared with allocate_shared, which lets you specify your own allocator. http://www.boost.org/doc/libs/1_56_0/libs/smart_ptr/make_shared.html
-
reporter So, no, what I'm getting at, can we not simply do some cmake magic to replace malloc with tbb's malloc, instead of requiring the user to do it with DYLD_INSERT_LIBRARIES?
-
According to this page it's not directly supported, but could be done indirectly by running a shell script which modifies the environment. In any case, I ran your timing script with LD_PRELOAD=libtbbmalloc_proxy.so.2, and saw no real difference:
with TBB replacement (LD_PRELOAD=libtbbmalloc_proxy.so.2):
cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression GeneralSFMFactor2<Cal3_S2> : 1.32 musecs/call Bin(Leaf,Un(Bin(Leaf,Leaf))): 2 musecs/call Ternary(Leaf,Leaf,Leaf) : 1.83 musecs/call GenericProjectionFactor<P,P>: 1.18 musecs/call Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.72 musecs/call Binary(Leaf,Leaf) : 1.49 musecs/call cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression GeneralSFMFactor2<Cal3_S2> : 1.28 musecs/call Bin(Leaf,Un(Bin(Leaf,Leaf))): 2.09 musecs/call Ternary(Leaf,Leaf,Leaf) : 1.85 musecs/call GenericProjectionFactor<P,P>: 1.18 musecs/call Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.74 musecs/call Binary(Leaf,Leaf) : 1.54 musecs/call cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression GeneralSFMFactor2<Cal3_S2> : 1.3 musecs/call Bin(Leaf,Un(Bin(Leaf,Leaf))): 2.05 musecs/call Ternary(Leaf,Leaf,Leaf) : 1.92 musecs/call GenericProjectionFactor<P,P>: 1.19 musecs/call Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.78 musecs/call Binary(Leaf,Leaf) : 1.55 musecs/call
without:
cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression GeneralSFMFactor2<Cal3_S2> : 1.3 musecs/call Bin(Leaf,Un(Bin(Leaf,Leaf))): 2.05 musecs/call Ternary(Leaf,Leaf,Leaf) : 1.85 musecs/call GenericProjectionFactor<P,P>: 1.17 musecs/call Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.71 musecs/call Binary(Leaf,Leaf) : 1.56 musecs/call cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression GeneralSFMFactor2<Cal3_S2> : 1.22 musecs/call Bin(Leaf,Un(Bin(Leaf,Leaf))): 1.94 musecs/call Ternary(Leaf,Leaf,Leaf) : 1.75 musecs/call GenericProjectionFactor<P,P>: 1.09 musecs/call Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.6 musecs/call Binary(Leaf,Leaf) : 1.46 musecs/call cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression GeneralSFMFactor2<Cal3_S2> : 1.26 musecs/call Bin(Leaf,Un(Bin(Leaf,Leaf))): 2.04 musecs/call Ternary(Leaf,Leaf,Leaf) : 1.85 musecs/call GenericProjectionFactor<P,P>: 1.11 musecs/call Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.74 musecs/call Binary(Leaf,Leaf) : 1.52 musecs/call
-
reporter Chris, I just checked in a multi-threaded version of timeCameraExpression. Shockingly, it runs much slower than the single-threaded version. The new/delete are very slow and are the cause. And more complicated expressions are punished more heavily, as they allocate a larger trace:
GeneralSFMFactor2<Cal3_S2> : 10.6 musecs/call Bin(Leaf,Un(Bin(Leaf,Leaf))): 24.9 musecs/call Ternary(Leaf,Leaf,Leaf) : 13.7 musecs/call GenericProjectionFactor<P,P>: 4.6 musecs/call Bin(Cnst,Un(Bin(Leaf,Leaf))): 6.73 musecs/call Binary(Leaf,Leaf) : 5.71 musecs/call
Could you see whether the libtbbmalloc_proxy makes a big difference now?
-
No difference. Makes me wonder if the preloading is even having any effect. (Or whether the implementation of malloc on this system isn't that horrible after all.) Your timings also have a lot more deviation.
with TBB replacement (LD_PRELOAD=libtbbmalloc_proxy.so.2):
GeneralSFMFactor2<Cal3_S2> : 4.46 musecs/call Bin(Leaf,Un(Bin(Leaf,Leaf))): 4.65 musecs/call Ternary(Leaf,Leaf,Leaf) : 4.48 musecs/call GenericProjectionFactor<P,P>: 2.73 musecs/call Bin(Cnst,Un(Bin(Leaf,Leaf))): 3.84 musecs/call Binary(Leaf,Leaf) : 3.54 musecs/call
without:
GeneralSFMFactor2<Cal3_S2> : 4.4 musecs/call Bin(Leaf,Un(Bin(Leaf,Leaf))): 4.7 musecs/call Ternary(Leaf,Leaf,Leaf) : 4.53 musecs/call GenericProjectionFactor<P,P>: 2.71 musecs/call Bin(Cnst,Un(Bin(Leaf,Leaf))): 3.85 musecs/call Binary(Leaf,Leaf) : 3.56 musecs/call
-
reporter Thanks!!! Hmmmmm. Note also that the multi-threaded version (basically, the TBB-parallelized linearize) is still three times slower on your machine (how many cores?). So, indeed, new/delete seem to be fine for you, but TBB multi-threading still incurs a considerable penalty. Any theories, @richardroberts ?
-
reporter BTW, I came across this page on the poor performance of malloc on Mac OS. Is linking with a (thread-safe) version of Doug Lea's malloc something we should consider?
-
reporter @cbeall3 Can you run timing again on same machine? I basically implemented my own memory allocation for a crucial part. By now the AD is only a small part of linearize anymore, rest are new/mallocs in the GTSAM part. My timing now is
GeneralSFMFactor2<Cal3_S2> : 11.4 musecs/call Bin(Leaf,Un(Bin(Leaf,Leaf))): 10.2 musecs/call Ternary(Leaf,Leaf,Leaf) : 10.5 musecs/call GenericProjectionFactor<P,P>: 4.47 musecs/call Bin(Cnst,Un(Bin(Leaf,Leaf))): 5.39 musecs/call Binary(Leaf,Leaf) : 5.16 musecs/call
-
Updated: Copied/pasted the wrong thing:
GeneralSFMFactor2<Cal3_S2> : 4.49 musecs/call Bin(Leaf,Un(Bin(Leaf,Leaf))): 4.3 musecs/call Ternary(Leaf,Leaf,Leaf) : 4.36 musecs/call GenericProjectionFactor<P,P>: 2.76 musecs/call Bin(Cnst,Un(Bin(Leaf,Leaf))): 3.56 musecs/call Binary(Leaf,Leaf) : 3.55 musecs/call
When I restrict the number of threads, it actually gets better (used 3 here):
GeneralSFMFactor2<Cal3_S2> : 2.05 musecs/call Bin(Leaf,Un(Bin(Leaf,Leaf))): 2.8 musecs/call Ternary(Leaf,Leaf,Leaf) : 2.71 musecs/call GenericProjectionFactor<P,P>: 1.7 musecs/call Bin(Cnst,Un(Bin(Leaf,Leaf))): 2.34 musecs/call Binary(Leaf,Leaf) : 2.23 musecs/call
That's on an Intel i-7 with 8 cores. I'm unsure about side-effects of custom allocators with respect to Eigen and TBB, but we can try it.
-
reporter Aha! Just saw your edit. Basically the same. Yes, multi-threading seems to hurt linearization. [[Thanks! But is that on the same machine? The performance of the old factors, like GeneralSFMFactor2<Cal3_S2> should be the same as before, as I did not touch those :-)]]
-
I accidentally copied/pasted the wrong timings from using only 3 threads. I just updated my previous post to reflect the correct timings.
-
reporter - changed status to wontfix
I think if perftools's tcmalloc does the job this could be resolved via issue #133
- Log in to comment
We actually are using the TBB allocator by default when TBB is enabled. As part of configuring GTSAM an allocator is selected, and this is used in FastList, FastVector, etc. There's also an advanced CMake flag to make your own selection: GTSAM_DEFAULT_ALLOCATOR. When building without TBB you have a choice of STL and BoostPool allocators. When TBB is enabled, the allocator must be TBB. Other objects use the default allocator, and Eigen also does it's own thing because word alignment is important there. Iirc, TBB/BoostPool allocators are not compatible with Eigen's alignment requirements.