assert failure when mixing our Eigen with the release Eigen

Issue #1 open
Nathan Michael created an issue

The issue priority is major as it breaks builds when using the system version of eigen.

Does the 3rdparty eigen source include changes to the tagged 3.2.0 version distributed by eigen?

Macros.h suggests 3.2.0 (e.g. the following two files are the same): gtsam: Eigen/src/Core/util/Macros.h https://bitbucket.org/gtborg/gtsam/src/191d18290d3a6ab106362db0adf50e795c65d39d/gtsam/3rdparty/Eigen/Eigen/src/Core/util/Macros.h?at=develop

eigen: Eigen/src/Core/util/Macros.h https://bitbucket.org/eigen/eigen/src/ffa86ffb557094721ca71dcea6aed2651b9fd610/Eigen/src/Core/util/Macros.h?at=3.2.0

However, there is an implementation difference in the HouseHolderQR interface:

gtsam: Eigen/src/QR/HouseholderQR.h https://bitbucket.org/gtborg/gtsam/src/191d18290d3a6ab106362db0adf50e795c65d39d/gtsam/3rdparty/Eigen/Eigen/src/QR/HouseholderQR.h?at=develop

eigen: Eigen/src/QR/HouseholderQR.h https://bitbucket.org/eigen/eigen/src/ffa86ffb557094721ca71dcea6aed2651b9fd610/Eigen/src/QR/HouseholderQR.h?at=3.2.0

The attached patch will modify the HouseHolderQR.h file in gtsam to that of the equivalent tagged version Eigen (3.2.0).

This difference results in a build error when enabling the system Eigen vs the 3rdparty Eigen distributed with gtsam as the interface changes.

Comments (37)

  1. Nathan Michael reporter

    It looks like recent changes were made to leverage MKL except that they do not compile on my system. Based on the remarks in the commit log, I suspect the issue is that I'm building with clang and libc++ versus libstdc++. There appears to be something amiss with the template specialization but I haven't investigated thoroughly.

    Can anyone else build the current pull of the developer branch against the homebrew eigen 3.2.0 on OSX Mavericks?

  2. Nathan Michael reporter

    Yes. I should have reviewed the commit history for the HouseHolderQR.h file prior to asking. It is clearly changed in order to support the Intel MKL.

  3. Frank Dellaert

    This is Richard's change. Can you resolve it for now by compiling with the Eigen provided in gtsam?

  4. Richard Roberts

    Currently GTSAM won't build against the latest Eigen release - I have 3 patches that I submitted to Eigen but are not yet incorporated into the Eigen release:

    2 are only required when compiling with MKL (GTSAM_WITH_EIGEN_MKL) - fixing an incorrect return code (http://eigen.tuxfamily.org/bz/show_bug.cgi?id=705) - and an adjustment to compile-time MKL selection in Householder QR (http://eigen.tuxfamily.org/bz/show_bug.cgi?id=704).

    The other is an improvement in comma initialization (http://eigen.tuxfamily.org/bz/show_bug.cgi?id=716)

    The good point is that our patched Eigen should be binary-compatible with the 3.2.0 release. So can you compile GTSAM against its built-in Eigen while compiling the rest of your code against a system-wide Eigen (to avoid having to compile the rest of your code against the GTSAM-bundled Eigen).

  5. Nathan Michael reporter

    The iSAM incremental solver does not work when linking against gtsam if gtsam includes the 3rdparty eigen headers while the referencing library includes the system eigen headers. I believe this issue arose when moving from 3.1.2 to 3.2.0. This holds true on my system even if the eigen versions are the same. The solution I found that fixes the issue is to use only the system version. In the interim, I can locally revert the file in question as I don't require MKL to get past the issue.

  6. Frank Dellaert

    I can confirm the same behavior on my mac. So, Richard, confirm then that the GTSAM_USE_SYSTEM_EIGEN cannot be set to ON until Eigen releases at least the comma initialization?

  7. Richard Roberts

    Nate, I don't know why this would happen but I'm going to try to duplicate it. What do you mean that the solver doesn't work? Is there an error message you can paste?

    Frank, yes, GTSAM_USE_SYSTEM_EIGEN should now be removed until Eigen includes our patches, I'll do that. Though, unless I'm misunderstanding, GTSAM_USE_SYSTEM_EIGEN is not related to the problem Nate described in his last comment?

  8. Nathan Michael reporter

    Richard, I agree, "doesn't work" is not helpful. However, this issue appears to have drifted to two separate issues: (1) GTSAM_USE_SYSTEM_EIGEN cannot be used at this time as local interface changes to the 3rdparty Eigen are not consistent with the equivalent system version; and (2) the incremental iSAM solver fails due to an assert issue when building gtsam using the 3rdparty Eigen instance while the referencing instance includes the system Eigen.

    I suggest we close this issue and create two new issues focused on these two topics.

  9. Richard Roberts
    • changed status to open

    Reopening since this includes the info for the assert failure when mixing Eigen versions.

  10. Richard Roberts

    I'm trying to reproduce the error that occurs when mixing Eigen versions. Attached is a simple test that does not reproduce the error - it works properly. Do you know what I should modify in order to reproduce the error?

  11. Nathan Michael reporter

    Here is a step-by-step to reproduce the error, as well as the resulting error:

    mkdir gtsam_incremental_example && cd gtsam_incremental_example

    wget https://research.cc.gatech.edu/borg/sites/edu.borg/files/downloads/gtsam-2.3.1.tgz

    tar zxf gtsam-2.3.1.tgz

    mkdir gtsam-2.3.1/build && cd gtsam-2.3.1/build

    cmake .. -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=pwd/.. -DGTSAM_USE_SYSTEM_EIGEN=OFF -DGTSAM_BUILD_EXAMPLES=OFF -DGTSAM_BUILD_WRAP=OFF -DGTSAM_INSTALL_WRAP=OFF -DGTSAM_INSTALL_MATLAB_TOOLBOX=OFF -DCMAKE_CXX_FLAGS="-ftemplate-depth=1024"

    make -j6 -l6 install

    cd ../..

    cp gtsam-2.3.1/examples/VisualISAM2Example.cpp .

    c++ -Igtsam-2.3.1/include -c VisualISAM2Example.cpp -o VisualISAM2Example.o

    c++ -Lgtsam-2.3.1/lib -o VisualISAM2Example VisualISAM2Example.o -lgtsam -lboost_system

    ./VisualISAM2Example

    Output:

    Assertion failed: (((!PanelMode) && stride==0 && offset==0) || (PanelMode && stride>=depth && offset<=stride)), function operator(), file gtsam-2.3.1/include/gtsam/3rdparty/Eigen/Eigen/src/Core/products/GeneralBlockPanelKernel.h, line 1218. Abort trap: 6

  12. Richard Roberts

    Note that the commands you pasted use gtsam 2.3.1, which does not use a modified version of Eigen. So as far as I can tell, this error is unrelated to using a system-installed Eigen or not.

    I'm unable to reproduce this error - the commands you pasted work properly on my system, but I have some thoughts on what could be causing it on your system:

    One possibility is runtime linking against a gtsam library of a different version, or different build type, than the compile-time library. Sometimes you will get errors if you build against a Release library and run against a Debug one - the STL iterators are different.

    Just so I understand, you ultimately want to use gtsam 3.0 (which does have a modifed Eigen), right?

  13. Richard Roberts

    Ok, it doesn't seem like this asset failure is related to the GTSAM version, but let me know if you're still having this problem, if so the best way to investigate is to Skype about it I think.

  14. Nathan Michael reporter

    I believe the issue is due to clang optimization.

    The debug process, starting with the error, follows:

    Assertion failed: (((!PanelMode) && stride==0 && offset==0) || (PanelMode && stride>=depth && offset<=stride)), function operator(), file gtsam-2.3.1/include/gtsam/3rdparty/Eigen/Eigen/src/Core/products/GeneralBlockPanelKernel.h, line 1218.
    

    Note that the call is:

    Eigen::internal::gemm_pack_rhs<double, long, 4, 0, false, false>::operator() (this=0x100747e00, blockB=0x100a00790, rhs=0x100b02720, rhsStride=2, depth=2, cols=10, stride=4399662, offset=35596697357744) at GeneralBlockPanelKernel.h:1218
    

    The function declaration is:

    template<typename Scalar, typename Index, int nr, bool Conjugate, bool PanelMode>
    EIGEN_DONT_INLINE void gemm_pack_rhs<Scalar, Index, nr, ColMajor, Conjugate, PanelMode>
      ::operator()(Scalar* blockB, const Scalar* rhs, Index rhsStride, Index depth, Index cols, Index stride, Index offset)
    

    Therefore, the assert failing is:

    (!PanelMode) && stride==0 && offset==0
    

    with

    PanelMode = false
    stride = 4399662
    offset = 35596697357744
    

    The stride and offset numbers look bogus given the problem description. Tracing the call function:

    Eigen::internal::general_matrix_matrix_product<long, double, 1, false, double, 0, false, 0>::run ()
    

    L171, GeneralMatrixMatrix.h, without OpenMP:

    pack_rhs(blockB, &rhs(k2,0), rhsStride, actual_kc, cols);
    

    Which uses the function declaration in GeneralBlockPanelKernel.h:

    template<typename Scalar, typename Index, int nr, bool Conjugate, bool PanelMode>
    struct gemm_pack_rhs<Scalar, Index, nr, ColMajor, Conjugate, PanelMode>
    {
      typedef typename packet_traits<Scalar>::type Packet;
      enum { PacketSize = packet_traits<Scalar>::size };
      EIGEN_DONT_INLINE void operator()(Scalar* blockB, const Scalar* rhs, Index rhsStride, Index depth, Index cols, Index stride=0, Index offset=0);
    };
    

    So, clearly the call to:

      EIGEN_DONT_INLINE void operator()(Scalar* blockB, const Scalar* rhs, Index rhsStride, Index depth, Index cols, Index stride=0, Index offset=0);
    

    is not setting stride and offset to zero as expected by the default values.

    Why would the default parameters be ignored? I believe clang is trying to be overly intelligent. Note that stride and offset are only required if PanelMode = true. It is likely that clang chooses to not initialize the memory to zero, even if the default value is set in the declaration.

    The easiest way to test the hypothesis is to introduce an arbitrary usage to force the logic initialization but does nothing.

    Modify the definition of gemm_pack_rhs (L1221) as follows:

    From:

      Index count = 0;
      for(Index j2=0; j2<packet_cols; j2+=nr)
      {
        // skip what we have before
        if(PanelMode) count += nr * offset;
    

    To:

      Index count = 0;
      Index tmp = 0;
      for(Index j2=0; j2<packet_cols; j2+=nr)
      {
        // skip what we have before
        if(PanelMode)
          count += nr * offset;
        else
          tmp += stride + offset;
    

    It works! Why clang ignores the usage in the assert case is unclear, perhaps it is a bug with the compiler.

    Note that the logic check:

    (!PanelMode) && stride==0 && offset==0
    

    is pointless as both stride and offset are only used if PanelMode = true. Therefore, the best fix is to change the assert condition to:

      eigen_assert((!PanelMode) || (PanelMode && stride>=depth && offset<=stride));
    

    at L1218 and L1269.

    Passing -DNDEBUG would also silence the assert but at the cost of not being able to use debug assert elsewhere.

    I'm closing the issue as it is now answered.

  15. Richard Roberts

    Nate, do you have time to provide a backtrace from gdb on this assert failure so I can see if there's something we can do to work around it instead of modifying Eigen?

    A long shot idea - do you think it's possible that this is occurring because the Eigen function is compiled into a library (e.g. GTSAM) in debug mode, enabling the assert, but the code calling the function is complied in release mode? I'm wondering if this would make clang assume the arguments would be ignored when in fact they are not.

  16. Nathan Michael reporter

    I agree; moving the fix to outside of Eigen is ideal. Unfortunately, it is an internal issue. I have not narrowed down the list of clang optimization features to identify which is the culprit.

    For now, the options appear to be:

    • Apply the patch.
    • Compile gtsam without aggressive optimizations which is accomplished by setting build type to debug.
    • Compile gtsam with release build type and compile calling software with NDEBUG.
  17. Frank Dellaert

    There is a fourth option, maybe, as they are just templates? If this bug appears only at one or two places in the code, could you compile just that call with NDEBUG ? Maybe that is what you meant...

  18. Richard Roberts

    Another option could be to specifically disable the clang optimization that's causing the problem - we could do this in GTSAM if that would fix the problem?

  19. Nathan Michael reporter

    @fdellaert Yes, adding:

    #define NDEBUG
    

    is equivalent to passing it at compile time for the calling routine compilation.

    @richardroberts Yes, one could identify the clang optimization feature but I have not done so.

  20. Richard Roberts

    @fdellaert and @nmichael, I'm also wondering if it would be good to guard all of our header files with macros to set NDEBUG to whatever state it was when GTSAM was compiled - because some of our headers contain inline code that would be compiled with the calling project's NDEBUG state instead of GTSAM's.

    @nmichael, when you have time, could you please let me know which GTSAM functions you're calling that are causing trouble? I'm going to try to understand exactly what's going on and whether there's something we can or should do about it inside of GTSAM versus in a calling project.

  21. Richard Roberts

    I could not find any other libraries that do that, no. Though our case of a mixed header/compiled library is somewhat unusual, and it is essentially that aspect that is causing the problem, I believe.

  22. Richard Roberts

    Hi @nmichael, I'm looking at disabling the specific compiler optimization that causes this assert failure (due to uninitialized default arguments). Since I'm unable to reproduce it, are you available/able to test a potential fix?

  23. Log in to comment