TBB malloc?

Issue #125 wontfix

Frank Dellaert created an issue 2014-10-09

Especially @cbeall3 and @richardroberts . I read about Automatically Replacing malloc and Other C/C++ Functions for Dynamic Memory Allocation and other pages that maintain that TBB malloc is way faster than built-in malloc. When we link/compile with TBB, are we then also using that? I think not, right?

BTW I tried a drop in replacement with DYLD_INSERT_LIBRARIES as described here, but

gtsam_unstable/timing>setenv DYLD_INSERT_LIBRARIES /opt/intel/tbb/lib/libtbbmalloc_proxy.dylib
gtsam_unstable/timing>./timeCameraExpression
Segmentation fault

Comments (16)

Chris Beall
We actually are using the TBB allocator by default when TBB is enabled. As part of configuring GTSAM an allocator is selected, and this is used in FastList, FastVector, etc. There's also an advanced CMake flag to make your own selection: GTSAM_DEFAULT_ALLOCATOR. When building without TBB you have a choice of STL and BoostPool allocators. When TBB is enabled, the allocator must be TBB. Other objects use the default allocator, and Eigen also does it's own thing because word alignment is important there. Iirc, TBB/BoostPool allocators are not compatible with Eigen's alignment requirements.
- 2014-10-09T15:09:21+00:00
Frank Dellaert reporter
But that is only for STL collections, no?
- 2014-10-09T19:18:50+00:00
Chris Beall
Yup
- 2014-10-09T19:19:30+00:00
Frank Dellaert reporter
So, what about new/delete of objects?
- 2014-10-09T20:52:20+00:00
Chris Beall
You probably want to try replacing make_shared with allocate_shared, which lets you specify your own allocator. http://www.boost.org/doc/libs/1_56_0/libs/smart_ptr/make_shared.html
- 2014-10-09T22:12:32+00:00
Frank Dellaert reporter
So, no, what I'm getting at, can we not simply do some cmake magic to replace malloc with tbb's malloc, instead of requiring the user to do it with DYLD_INSERT_LIBRARIES?
- 2014-10-09T22:15:32+00:00

Chris Beall

According to this page it's not directly supported, but could be done indirectly by running a shell script which modifies the environment. In any case, I ran your timing script with LD_PRELOAD=libtbbmalloc_proxy.so.2, and saw no real difference:

with TBB replacement (LD_PRELOAD=libtbbmalloc_proxy.so.2):

cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression 
GeneralSFMFactor2<Cal3_S2> : 1.32 musecs/call
Bin(Leaf,Un(Bin(Leaf,Leaf))): 2 musecs/call
Ternary(Leaf,Leaf,Leaf) : 1.83 musecs/call
GenericProjectionFactor<P,P>: 1.18 musecs/call
Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.72 musecs/call
Binary(Leaf,Leaf) : 1.49 musecs/call
cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression 
GeneralSFMFactor2<Cal3_S2> : 1.28 musecs/call
Bin(Leaf,Un(Bin(Leaf,Leaf))): 2.09 musecs/call
Ternary(Leaf,Leaf,Leaf) : 1.85 musecs/call
GenericProjectionFactor<P,P>: 1.18 musecs/call
Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.74 musecs/call
Binary(Leaf,Leaf) : 1.54 musecs/call
cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression 
GeneralSFMFactor2<Cal3_S2> : 1.3 musecs/call
Bin(Leaf,Un(Bin(Leaf,Leaf))): 2.05 musecs/call
Ternary(Leaf,Leaf,Leaf) : 1.92 musecs/call
GenericProjectionFactor<P,P>: 1.19 musecs/call
Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.78 musecs/call
Binary(Leaf,Leaf) : 1.55 musecs/call

without:

cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression 
GeneralSFMFactor2<Cal3_S2> : 1.3 musecs/call
Bin(Leaf,Un(Bin(Leaf,Leaf))): 2.05 musecs/call
Ternary(Leaf,Leaf,Leaf) : 1.85 musecs/call
GenericProjectionFactor<P,P>: 1.17 musecs/call
Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.71 musecs/call
Binary(Leaf,Leaf) : 1.56 musecs/call
cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression 
GeneralSFMFactor2<Cal3_S2> : 1.22 musecs/call
Bin(Leaf,Un(Bin(Leaf,Leaf))): 1.94 musecs/call
Ternary(Leaf,Leaf,Leaf) : 1.75 musecs/call
GenericProjectionFactor<P,P>: 1.09 musecs/call
Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.6 musecs/call
Binary(Leaf,Leaf) : 1.46 musecs/call
cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression 
GeneralSFMFactor2<Cal3_S2> : 1.26 musecs/call
Bin(Leaf,Un(Bin(Leaf,Leaf))): 2.04 musecs/call
Ternary(Leaf,Leaf,Leaf) : 1.85 musecs/call
GenericProjectionFactor<P,P>: 1.11 musecs/call
Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.74 musecs/call
Binary(Leaf,Leaf) : 1.52 musecs/call

2014-10-09T23:16:19+00:00

Frank Dellaert reporter
Chris, I just checked in a multi-threaded version of timeCameraExpression. Shockingly, it runs much slower than the single-threaded version. The new/delete are very slow and are the cause. And more complicated expressions are punished more heavily, as they allocate a larger trace:
```
GeneralSFMFactor2<Cal3_S2>  : 10.6 musecs/call
Bin(Leaf,Un(Bin(Leaf,Leaf))): 24.9 musecs/call
Ternary(Leaf,Leaf,Leaf)     : 13.7 musecs/call
GenericProjectionFactor<P,P>: 4.6 musecs/call
Bin(Cnst,Un(Bin(Leaf,Leaf))): 6.73 musecs/call
Binary(Leaf,Leaf)           : 5.71 musecs/call
```
Could you see whether the libtbbmalloc_proxy makes a big difference now?
- 2014-10-10T15:21:46+00:00

Chris Beall

No difference. Makes me wonder if the preloading is even having any effect. (Or whether the implementation of malloc on this system isn't that horrible after all.) Your timings also have a lot more deviation.

with TBB replacement (LD_PRELOAD=libtbbmalloc_proxy.so.2):

GeneralSFMFactor2<Cal3_S2>  : 4.46 musecs/call
Bin(Leaf,Un(Bin(Leaf,Leaf))): 4.65 musecs/call
Ternary(Leaf,Leaf,Leaf)     : 4.48 musecs/call
GenericProjectionFactor<P,P>: 2.73 musecs/call
Bin(Cnst,Un(Bin(Leaf,Leaf))): 3.84 musecs/call
Binary(Leaf,Leaf)           : 3.54 musecs/call

without:

GeneralSFMFactor2<Cal3_S2>  : 4.4 musecs/call
Bin(Leaf,Un(Bin(Leaf,Leaf))): 4.7 musecs/call
Ternary(Leaf,Leaf,Leaf)     : 4.53 musecs/call
GenericProjectionFactor<P,P>: 2.71 musecs/call
Bin(Cnst,Un(Bin(Leaf,Leaf))): 3.85 musecs/call
Binary(Leaf,Leaf)           : 3.56 musecs/call

2014-10-10T16:08:13+00:00

Frank Dellaert reporter
Thanks!!! Hmmmmm. Note also that the multi-threaded version (basically, the TBB-parallelized linearize) is still three times slower on your machine (how many cores?). So, indeed, new/delete seem to be fine for you, but TBB multi-threading still incurs a considerable penalty. Any theories, @richardroberts ?
- 2014-10-10T16:15:34+00:00
Frank Dellaert reporter
BTW, I came across this page on the poor performance of malloc on Mac OS. Is linking with a (thread-safe) version of Doug Lea's malloc something we should consider?
- 2014-10-10T20:59:03+00:00

Frank Dellaert reporter

@cbeall3 Can you run timing again on same machine? I basically implemented my own memory allocation for a crucial part. By now the AD is only a small part of linearize anymore, rest are new/mallocs in the GTSAM part. My timing now is

GeneralSFMFactor2<Cal3_S2>  : 11.4 musecs/call
Bin(Leaf,Un(Bin(Leaf,Leaf))): 10.2 musecs/call
Ternary(Leaf,Leaf,Leaf)     : 10.5 musecs/call
GenericProjectionFactor<P,P>: 4.47 musecs/call
Bin(Cnst,Un(Bin(Leaf,Leaf))): 5.39 musecs/call
Binary(Leaf,Leaf)           : 5.16 musecs/call

2014-10-11T14:04:00+00:00

Chris Beall

Updated: Copied/pasted the wrong thing:

GeneralSFMFactor2<Cal3_S2>  : 4.49 musecs/call
Bin(Leaf,Un(Bin(Leaf,Leaf))): 4.3 musecs/call
Ternary(Leaf,Leaf,Leaf)     : 4.36 musecs/call
GenericProjectionFactor<P,P>: 2.76 musecs/call
Bin(Cnst,Un(Bin(Leaf,Leaf))): 3.56 musecs/call
Binary(Leaf,Leaf)           : 3.55 musecs/call

When I restrict the number of threads, it actually gets better (used 3 here):

GeneralSFMFactor2<Cal3_S2>  : 2.05 musecs/call
Bin(Leaf,Un(Bin(Leaf,Leaf))): 2.8 musecs/call
Ternary(Leaf,Leaf,Leaf)     : 2.71 musecs/call
GenericProjectionFactor<P,P>: 1.7 musecs/call
Bin(Cnst,Un(Bin(Leaf,Leaf))): 2.34 musecs/call
Binary(Leaf,Leaf)           : 2.23 musecs/call

That's on an Intel i-7 with 8 cores. I'm unsure about side-effects of custom allocators with respect to Eigen and TBB, but we can try it.

2014-10-11T18:34:23+00:00

Frank Dellaert reporter
Aha! Just saw your edit. Basically the same. Yes, multi-threading seems to hurt linearization. [[Thanks! But is that on the same machine? The performance of the old factors, like GeneralSFMFactor2<Cal3_S2> should be the same as before, as I did not touch those :-)]]
- 2014-10-11T18:41:00+00:00
Chris Beall
I accidentally copied/pasted the wrong timings from using only 3 threads. I just updated my previous post to reflect the correct timings.
- 2014-10-11T18:43:51+00:00
Frank Dellaert reporter
- changed status to wontfix
I think if perftools's tcmalloc does the job this could be resolved via issue #133
- 2014-10-17T21:28:55+00:00
Log in to comment

Assignee: Chris Beall

Type: proposal

Priority: minor

Status: wontfix

Milestone: –

Version: –

Votes: 0

Watchers: 2