Segfault when running on data sets with large number of poses

Issue #369 resolved
Wenqiang Zhou created an issue

The issue was found when trying to run gtsam to optimize a large pose graph with prior and between factors.

A simple test case as attached in testGtsam.cc:

  • Create a pose chain with N 3D poses (N > 58200), add some random noise to each pose's initial values

  • Add a prior factor for the first pose

  • Add a same between factor for each consecutive poses

  • Create a NonLinearFactorGraph to optimize the pose chain using LM and GN algorithm, both got segfault

The backtrace is

gtsam::GaussianFactorGraph::optimize(boost::optional<gtsam::Ordering const&>, boost::function<std::pair<boost::shared_ptr<gtsam::GaussianConditional>, boost::shared_ptr<gtsam::GaussianFactor> > (gtsam::GaussianFactorGraph const&, gtsam::Ordering const&)> const&) const () from /home/wenqiang/mvs/software/build/lib/libgtsam.so.3
gtsam::EliminateableFactorGraph<gtsam::GaussianFactorGraph>::eliminateMultifrontal(boost::optional<gtsam::Ordering const&>, boost::function<std::pair<boost::shared_ptr<gtsam::GaussianConditional>, boost::shared_ptr<gtsam::GaussianFactor> > (gtsam::GaussianFactorGraph const&, gtsam::Ordering const&)> const&, boost::optional<gtsam::VariableIndex const&>) const () from /home/wenqiang/mvs/software/build/lib/libgtsam.so.3
gtsam::EliminateableFactorGraph<gtsam::GaussianFactorGraph>::eliminateMultifrontal(boost::optional<gtsam::Ordering const&>, boost::function<std::pair<boost::shared_ptr<gtsam::GaussianConditional>, boost::shared_ptr<gtsam::GaussianFactor> > (gtsam::GaussianFactorGraph const&, gtsam::Ordering const&)> const&, boost::optional<gtsam::VariableIndex const&>) const () from /home/wenqiang/mvs/software/build/lib/libgtsam.so.3
gtsam::ClusterTree<gtsam::GaussianBayesTree, gtsam::GaussianFactorGraph>::eliminate(boost::function<std::pair<boost::shared_ptr<gtsam::GaussianConditional>, boost::shared_ptr<gtsam::GaussianFactor> > (gtsam::GaussianFactorGraph const&, gtsam::Ordering const&)> const&) const () from /home/wenqiang/mvs/software/build/lib/libgtsam.so.3
?? () from /usr/lib/libtbb.so.2
?? () from /usr/lib/libtbb.so.2
gtsam::treeTraversal::internal::RootTask<gtsam::FastVector<boost::shared_ptr<gtsam::ClusterTree<gtsam::GaussianBayesTree, gtsam::GaussianFactorGraph>::Cluster> >, gtsam::ClusterTree<gtsam::GaussianBayesTree, gtsam::GaussianFactorGraph>::Cluster, gtsam::(anonymous namespace)::EliminationData<gtsam::ClusterTree<gtsam::GaussianBayesTree, gtsam::GaussianFactorGraph> >, gtsam::(anonymous namespace)::EliminationData<gtsam::ClusterTree<gtsam::GaussianBayesTree, gtsam::GaussianFactorGraph> > (boost::shared_ptr<gtsam::ClusterTree<gtsam::GaussianBayesTree, gtsam::GaussianFactorGraph>::Cluster> const&, gtsam::(anonymous namespace)::EliminationData<gtsam::ClusterTree<gtsam::GaussianBayesTree, gtsam::GaussianFactorGraph> >&), gtsam::(anonymous namespace)::EliminationPostOrderVisitor<gtsam::ClusterTree<gtsam::GaussianBayesTree, gtsam::GaussianFactorGraph> > >::execute() () from /home/wenqiang/mvs/software/build/lib/libgtsam.so.3
?? () from /usr/lib/libtbb.so.2
gtsam::treeTraversal::internal::PreOrderTask<gtsam::ClusterTree<gtsam::GaussianBayesTree, gtsam::GaussianFactorGraph>::Cluster, gtsam::(anonymous namespace)::EliminationData<gtsam::ClusterTree<gtsam::GaussianBayesTree, gtsam::GaussianFactorGraph> >, gtsam::(anonymous namespace)::EliminationData<gtsam::ClusterTree<gtsam::GaussianBayesTree, gtsam::GaussianFactorGraph> > (boost::shared_ptr<gtsam::ClusterTree<gtsam::GaussianBayesTree, gtsam::GaussianFactorGraph>::Cluster> const&, gtsam::(anonymous namespace)::EliminationData<gtsam::ClusterTree<gtsam::GaussianBayesTree, gtsam::GaussianFactorGraph> >&), gtsam::(anonymous namespace)::EliminationPostOrderVisitor<gtsam::ClusterTree<gtsam::GaussianBayesTree, gtsam::GaussianFactorGraph> > >::execute() () from /home/wenqiang/mvs/software/build/lib/libgtsam.so.3

Comments (7)

  1. Wenqiang Zhou reporter

    I haven't figured out why it segfault. It happens in the Elimination step when trying to refer a null pointer. Since it only happened for large number of nodes, could it be a precision related issue?

  2. Frank Dellaert

    I hope you're OK with me adding your example as a stress test in PR #363. I got the same segmentation fault, but only after N>=110,000, indicating an out of memory problem. Switching to METIS ordering allowed for much larger N. I tested up to 1,000,000, which took 46 seconds per iteration on my (old) Mac.

  3. Wenqiang Zhou reporter

    Thanks for your reply. I got the segfault error on a machine with 32G RAM. I switched to a machine with 64G RAM and got the same segfault error with about the same N. I also used valgrind to check the memory usage for the test program, looks like the peak memory usage is only around 220 MB with N = 58000. So I don't think it's the memory problem or I did something wrong.

    I haven't tried the METIS ordering as I was using the old version, I will try that to see if it helps.

  4. Wenqiang Zhou reporter

    I switched to METIS ordering, and it allowed me to run up to 5800000 with a 64G RAM machine. The METIS ordering did help, but I still don't think it's memory related issue as using a machine with larger RAM did not help with my old test code.

  5. russell

    Just wanted to add that I had the same issue with N > 62,000 and 15GB of RAM with 2 GB of swap. I didn’t see any major spikes in memory usage but switching to METIS ordering did fix the issue. However, it was a bit difficult to understand what METIS is and how I was supposed to use it. So for any future people with many nodes I recommend the example here https://github.com/borglab/gtsam/blob/develop/examples/METISOrderingExample.cpp And documentation here: http://glaros.dtc.umn.edu/gkhome/metis/metis/overview

    It might be worth in general recommending METIS ordering for large factor graphs.

  6. Log in to comment