TBB malloc?

Issue #125 wontfix
Frank Dellaert created an issue

Especially @cbeall3 and @richardroberts . I read about Automatically Replacing malloc and Other C/C++ Functions for Dynamic Memory Allocation and other pages that maintain that TBB malloc is way faster than built-in malloc. When we link/compile with TBB, are we then also using that? I think not, right?

BTW I tried a drop in replacement with DYLD_INSERT_LIBRARIES as described here, but

gtsam_unstable/timing>setenv DYLD_INSERT_LIBRARIES /opt/intel/tbb/lib/libtbbmalloc_proxy.dylib
gtsam_unstable/timing>./timeCameraExpression
Segmentation fault

Comments (16)

  1. Chris Beall

    We actually are using the TBB allocator by default when TBB is enabled. As part of configuring GTSAM an allocator is selected, and this is used in FastList, FastVector, etc. There's also an advanced CMake flag to make your own selection: GTSAM_DEFAULT_ALLOCATOR. When building without TBB you have a choice of STL and BoostPool allocators. When TBB is enabled, the allocator must be TBB. Other objects use the default allocator, and Eigen also does it's own thing because word alignment is important there. Iirc, TBB/BoostPool allocators are not compatible with Eigen's alignment requirements.

  2. Frank Dellaert reporter

    So, no, what I'm getting at, can we not simply do some cmake magic to replace malloc with tbb's malloc, instead of requiring the user to do it with DYLD_INSERT_LIBRARIES?

  3. Chris Beall

    According to this page it's not directly supported, but could be done indirectly by running a shell script which modifies the environment. In any case, I ran your timing script with LD_PRELOAD=libtbbmalloc_proxy.so.2, and saw no real difference:

    with TBB replacement (LD_PRELOAD=libtbbmalloc_proxy.so.2):

    cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression 
    GeneralSFMFactor2<Cal3_S2> : 1.32 musecs/call
    Bin(Leaf,Un(Bin(Leaf,Leaf))): 2 musecs/call
    Ternary(Leaf,Leaf,Leaf) : 1.83 musecs/call
    GenericProjectionFactor<P,P>: 1.18 musecs/call
    Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.72 musecs/call
    Binary(Leaf,Leaf) : 1.49 musecs/call
    cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression 
    GeneralSFMFactor2<Cal3_S2> : 1.28 musecs/call
    Bin(Leaf,Un(Bin(Leaf,Leaf))): 2.09 musecs/call
    Ternary(Leaf,Leaf,Leaf) : 1.85 musecs/call
    GenericProjectionFactor<P,P>: 1.18 musecs/call
    Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.74 musecs/call
    Binary(Leaf,Leaf) : 1.54 musecs/call
    cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression 
    GeneralSFMFactor2<Cal3_S2> : 1.3 musecs/call
    Bin(Leaf,Un(Bin(Leaf,Leaf))): 2.05 musecs/call
    Ternary(Leaf,Leaf,Leaf) : 1.92 musecs/call
    GenericProjectionFactor<P,P>: 1.19 musecs/call
    Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.78 musecs/call
    Binary(Leaf,Leaf) : 1.55 musecs/call
    

    without:

    cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression 
    GeneralSFMFactor2<Cal3_S2> : 1.3 musecs/call
    Bin(Leaf,Un(Bin(Leaf,Leaf))): 2.05 musecs/call
    Ternary(Leaf,Leaf,Leaf) : 1.85 musecs/call
    GenericProjectionFactor<P,P>: 1.17 musecs/call
    Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.71 musecs/call
    Binary(Leaf,Leaf) : 1.56 musecs/call
    cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression 
    GeneralSFMFactor2<Cal3_S2> : 1.22 musecs/call
    Bin(Leaf,Un(Bin(Leaf,Leaf))): 1.94 musecs/call
    Ternary(Leaf,Leaf,Leaf) : 1.75 musecs/call
    GenericProjectionFactor<P,P>: 1.09 musecs/call
    Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.6 musecs/call
    Binary(Leaf,Leaf) : 1.46 musecs/call
    cbeall3@panther:~/git/gtsam/build/gtsam_unstable/timing$ ./timeCameraExpression 
    GeneralSFMFactor2<Cal3_S2> : 1.26 musecs/call
    Bin(Leaf,Un(Bin(Leaf,Leaf))): 2.04 musecs/call
    Ternary(Leaf,Leaf,Leaf) : 1.85 musecs/call
    GenericProjectionFactor<P,P>: 1.11 musecs/call
    Bin(Cnst,Un(Bin(Leaf,Leaf))): 1.74 musecs/call
    Binary(Leaf,Leaf) : 1.52 musecs/call
    
  4. Frank Dellaert reporter

    Chris, I just checked in a multi-threaded version of timeCameraExpression. Shockingly, it runs much slower than the single-threaded version. The new/delete are very slow and are the cause. And more complicated expressions are punished more heavily, as they allocate a larger trace:

    GeneralSFMFactor2<Cal3_S2>  : 10.6 musecs/call
    Bin(Leaf,Un(Bin(Leaf,Leaf))): 24.9 musecs/call
    Ternary(Leaf,Leaf,Leaf)     : 13.7 musecs/call
    GenericProjectionFactor<P,P>: 4.6 musecs/call
    Bin(Cnst,Un(Bin(Leaf,Leaf))): 6.73 musecs/call
    Binary(Leaf,Leaf)           : 5.71 musecs/call
    

    Could you see whether the libtbbmalloc_proxy makes a big difference now?

  5. Chris Beall

    No difference. Makes me wonder if the preloading is even having any effect. (Or whether the implementation of malloc on this system isn't that horrible after all.) Your timings also have a lot more deviation.

    with TBB replacement (LD_PRELOAD=libtbbmalloc_proxy.so.2):

    GeneralSFMFactor2<Cal3_S2>  : 4.46 musecs/call
    Bin(Leaf,Un(Bin(Leaf,Leaf))): 4.65 musecs/call
    Ternary(Leaf,Leaf,Leaf)     : 4.48 musecs/call
    GenericProjectionFactor<P,P>: 2.73 musecs/call
    Bin(Cnst,Un(Bin(Leaf,Leaf))): 3.84 musecs/call
    Binary(Leaf,Leaf)           : 3.54 musecs/call
    

    without:

    GeneralSFMFactor2<Cal3_S2>  : 4.4 musecs/call
    Bin(Leaf,Un(Bin(Leaf,Leaf))): 4.7 musecs/call
    Ternary(Leaf,Leaf,Leaf)     : 4.53 musecs/call
    GenericProjectionFactor<P,P>: 2.71 musecs/call
    Bin(Cnst,Un(Bin(Leaf,Leaf))): 3.85 musecs/call
    Binary(Leaf,Leaf)           : 3.56 musecs/call
    
  6. Frank Dellaert reporter

    Thanks!!! Hmmmmm. Note also that the multi-threaded version (basically, the TBB-parallelized linearize) is still three times slower on your machine (how many cores?). So, indeed, new/delete seem to be fine for you, but TBB multi-threading still incurs a considerable penalty. Any theories, @richardroberts ?

  7. Frank Dellaert reporter

    @cbeall3 Can you run timing again on same machine? I basically implemented my own memory allocation for a crucial part. By now the AD is only a small part of linearize anymore, rest are new/mallocs in the GTSAM part. My timing now is

    GeneralSFMFactor2<Cal3_S2>  : 11.4 musecs/call
    Bin(Leaf,Un(Bin(Leaf,Leaf))): 10.2 musecs/call
    Ternary(Leaf,Leaf,Leaf)     : 10.5 musecs/call
    GenericProjectionFactor<P,P>: 4.47 musecs/call
    Bin(Cnst,Un(Bin(Leaf,Leaf))): 5.39 musecs/call
    Binary(Leaf,Leaf)           : 5.16 musecs/call
    
  8. Chris Beall

    Updated: Copied/pasted the wrong thing:

    GeneralSFMFactor2<Cal3_S2>  : 4.49 musecs/call
    Bin(Leaf,Un(Bin(Leaf,Leaf))): 4.3 musecs/call
    Ternary(Leaf,Leaf,Leaf)     : 4.36 musecs/call
    GenericProjectionFactor<P,P>: 2.76 musecs/call
    Bin(Cnst,Un(Bin(Leaf,Leaf))): 3.56 musecs/call
    Binary(Leaf,Leaf)           : 3.55 musecs/call
    

    When I restrict the number of threads, it actually gets better (used 3 here):

    GeneralSFMFactor2<Cal3_S2>  : 2.05 musecs/call
    Bin(Leaf,Un(Bin(Leaf,Leaf))): 2.8 musecs/call
    Ternary(Leaf,Leaf,Leaf)     : 2.71 musecs/call
    GenericProjectionFactor<P,P>: 1.7 musecs/call
    Bin(Cnst,Un(Bin(Leaf,Leaf))): 2.34 musecs/call
    Binary(Leaf,Leaf)           : 2.23 musecs/call
    

    That's on an Intel i-7 with 8 cores. I'm unsure about side-effects of custom allocators with respect to Eigen and TBB, but we can try it.

  9. Frank Dellaert reporter

    Aha! Just saw your edit. Basically the same. Yes, multi-threading seems to hurt linearization. [[Thanks! But is that on the same machine? The performance of the old factors, like GeneralSFMFactor2<Cal3_S2> should be the same as before, as I did not touch those :-)]]

  10. Chris Beall

    I accidentally copied/pasted the wrong timings from using only 3 threads. I just updated my previous post to reflect the correct timings.

  11. Log in to comment