Possible race condition in Event.hh causing Segmentation Fault

Issue #2332 new
Jørgen Nordmoen created an issue

This problem relates to this question which basically boils down to Gazebo segfaulting.

I have tried to debug the problem to the best of my abilities and have found that the problem occurs in Event.hh. See stack trace below:

(gdb) bt
#0  std::atomic<bool>::operator bool (this=0x0) at /home/jorgehn/gazebo/gazebo/physics/World.cc:740
#1  gazebo::event::EventT<void (gazebo::common::UpdateInfo const&)>::Signal<gazebo::common::UpdateInfo>(gazebo::common::UpdateInfo const&) (_p=..., this=<optimized out>)
    at /home/jorgehn/gazebo/gazebo/common/Event.hh:381
#2  gazebo::event::EventT<void (gazebo::common::UpdateInfo const&)>::operator()<gazebo::common::UpdateInfo>(gazebo::common::UpdateInfo const&) (_p=..., this=<optimized out>)
    at /home/jorgehn/gazebo/gazebo/common/Event.hh:216
#3  gazebo::physics::World::Update (this=this@entry=0xbd7950) at /home/jorgehn/gazebo/gazebo/physics/World.cc:740
#4  0x00007ffff62bf6cf in gazebo::physics::World::Step (this=this@entry=0xbd7950) at /home/jorgehn/gazebo/gazebo/physics/World.cc:672
#5  0x00007ffff62bfb45 in gazebo::physics::World::RunLoop (this=0xbd7950) at /home/jorgehn/gazebo/gazebo/physics/World.cc:481
#6  0x00007ffff40745d5 in ?? () from /usr/lib/x86_64-linux-gnu/libboost_thread.so.1.58.0
#7  0x00007ffff65dd6ba in start_thread (arg=0x7fff6e3fa700) at pthread_create.c:333
#8  0x00007ffff6be63dd in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
(gdb) whatis iter.second->on Attempt to take address of value not located in memory.

I do not know the cause of this null pointer, my suspicion is that I reset the world quite often which causes a race condition to eventually manifest in a null pointer dereference.
The reason I suspect this is that in the EventT<T>::Cleanup method of Event.hh there is a mutex. The mutex seems to be designed to protect, connections, but the mutex is not used in any other method. This lead me to believe that the race condition happens when connections is altered during world reset. To test this I put a std::lock_guard in every method in Event.hh and ran my program again. This time the race condition never happened (I ran the program for several hours). To semi verify I removed the std::lock_guard again and the program crashed within 20 minutes of running.

The mutex in Event.hh is present from Gazebo 7.0.0 (which is the version I first tried) and is still present on master in this repository. However, no other method in EventT<T> uses this mutex.

Because Gazebo is such a complicated and large software I have not been able to verify that this absolutely solves the problem. I hope someone more knowledgeable about Gazebo can take a look and see if the problem is realistic (could there ever be a race condition with Signal?) and how should the problem be solved (std::lock_guards all over the place is error prone as demonstrated by this problem if that is indeed the problem).

The "fix" I tested is based on parent 28214:a29d8dabed9a gazebo7_7.0.0:

diff -r a29d8dabed9a gazebo/common/Event.hh
--- a/gazebo/common/Event.hh    Tue Jan 26 01:05:38 2016 +0100
+++ b/gazebo/common/Event.hh    Thu Aug 17 10:49:33 2017 +0200
@@ -358,6 +358,7 @@
       {
         this->Cleanup();

+   std::lock_guard<std::mutex> lock(this->myDataPtr->mutex);
         this->myDataPtr->signaled = true;
         for (auto iter: this->myDataPtr->connections)
         {
@@ -373,6 +374,7 @@
       {
         this->Cleanup();

+   std::lock_guard<std::mutex> lock(this->myDataPtr->mutex);
         this->myDataPtr->signaled = true;
         for (auto iter: this->myDataPtr->connections)
         {
@@ -389,6 +391,7 @@
       {
         this->Cleanup();

+   std::lock_guard<std::mutex> lock(this->myDataPtr->mutex);
         this->myDataPtr->signaled = true;
         for (auto iter: this->myDataPtr->connections)
         {
@@ -406,6 +409,7 @@
       {
         this->Cleanup();

+   std::lock_guard<std::mutex> lock(this->myDataPtr->mutex);
         this->myDataPtr->signaled = true;
         for (auto iter: this->myDataPtr->connections)
         {
@@ -425,6 +429,7 @@
       {
         this->Cleanup();

+   std::lock_guard<std::mutex> lock(this->myDataPtr->mutex);
         this->myDataPtr->signaled = true;
         for (auto iter: this->myDataPtr->connections)
         {
@@ -446,6 +451,7 @@
       {
         this->Cleanup();

+   std::lock_guard<std::mutex> lock(this->myDataPtr->mutex);
         this->myDataPtr->signaled = true;
         for (auto iter: this->myDataPtr->connections)
         {
@@ -468,6 +474,7 @@
       {
         this->Cleanup();

+   std::lock_guard<std::mutex> lock(this->myDataPtr->mutex);
         this->myDataPtr->signaled = true;
         for (auto iter: this->myDataPtr->connections)
         {
@@ -491,6 +498,7 @@
       {
         this->Cleanup();

+   std::lock_guard<std::mutex> lock(this->myDataPtr->mutex);
         this->myDataPtr->signaled = true;
         for (auto iter: this->myDataPtr->connections.begin())
         {
@@ -516,6 +524,7 @@
       {
         this->Cleanup();

+   std::lock_guard<std::mutex> lock(this->myDataPtr->mutex);
         this->myDataPtr->signaled = true;
         for (auto iter: this->myDataPtr->connections)
         {
@@ -545,6 +554,7 @@
       {
         this->Cleanup();

+   std::lock_guard<std::mutex> lock(this->myDataPtr->mutex);
         this->myDataPtr->signaled = true;
         for (auto iter: this->myDataPtr->connections)
         {
@@ -576,6 +586,7 @@
       {
         this->Cleanup();

+   std::lock_guard<std::mutex> lock(this->myDataPtr->mutex);
         this->myDataPtr->signaled = true;
         for (auto iter: this->myDataPtr->connections)
         {
@@ -616,6 +627,7 @@
     template<typename T>
     ConnectionPtr EventT<T>::Connect(const boost::function<T> &_subscriber)
     {
+      std::lock_guard<std::mutex> lock(this->myDataPtr->mutex);
       int index = 0;
       if (!this->myDataPtr->connections.empty())
       {
@@ -645,6 +657,7 @@
     template<typename T>
     unsigned int EventT<T>::ConnectionCount() const
     {
+      std::lock_guard<std::mutex> lock(this->myDataPtr->mutex);
       return this->myDataPtr->connections.size();
     }

@@ -653,6 +666,7 @@
     template<typename T>
     void EventT<T>::Disconnect(int _id)
     {
+      std::lock_guard<std::mutex> lock(this->myDataPtr->mutex);
       // Find the connection
       auto const &it = this->myDataPtr->connections.find(_id);

I have not been able to test this on master or with Gazebo8 because of an unrelated bug which stops my setup from running.

Comments (2)

  1. Jørgen Nordmoen reporter

    After some more testing I think this solves the crashing that I and the other poster of the Gazebo Answers question was posting. I have now ran several experiments with Gazebo (running 5 repetitions with a parallel runtime of 6+ hours) with no crashing. This was something I have been unable to do before due to random crashes. I tested the patch on the latest Gazebo 7.8.1 and it works there as well.

  2. Log in to comment