Failing shutdown

Issue #21 resolved
Joakim Stien created an issue

During shutdown when no window is active, the application fails to shut down, and instead becomes unresponsive. This happens very frequently. None of the shutdown messages are displayed. The program most frequently producing this issue is 03-updating.

By just looking at the profiler output (produced with callgrind), this may seem like a probability issue, and that the problem may arise when closing a NOX-application of any kind. When running "03-updating", the executing spent 84% of the time inside Manager::updateProcesses (which you can see from the stack trace below is where it was when it received the SIGINT), while it spends 0.01% of its time in the same method when running sample "10-windows".

I produced the following stack trace multiple times.

Program received signal SIGSEGV, Segmentation fault.
__lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
95  in ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S
(gdb) where
#0  __lll_lock_wait_private () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:95
#1  0x00007ffff705395a in _L_lock_12779 () at malloc.c:5206
#2  0x00007ffff7051335 in __GI___libc_malloc (bytes=272) at malloc.c:2887
#3  0x00007ffff7b1c3c2 in SDL_AddEvent (event=0x7fffffffd130) at /build/buildd/libsdl2-2.0.2+dfsg1/src/events/SDL_events.c:205
#4  SDL_PeepEvents_REAL (events=<optimized out>, numevents=<optimized out>, action=<optimized out>, minType=0, maxType=<optimized out>) at /build/buildd/libsdl2-2.0.2+dfsg1/src/events/SDL_events.c:283
#5  0x00007ffff7b1c6a6 in SDL_PushEvent_REAL (event=event@entry=0x7fffffffd130) at /build/buildd/libsdl2-2.0.2+dfsg1/src/events/SDL_events.c:471
#6  0x00007ffff7b1ca3b in SDL_SendAppEvent (eventType=SDL_QUIT) at /build/buildd/libsdl2-2.0.2+dfsg1/src/events/SDL_events.c:629
#7  <signal handler called>
#8  malloc_consolidate (av=av@entry=0x7ffff738d760 <main_arena>) at malloc.c:4143
#9  0x00007ffff704e0fd in _int_free (av=0x7ffff738d760 <main_arena>, p=<optimized out>, have_lock=0) at malloc.c:4057
#10 0x0000000000433b78 in __gnu_cxx::new_allocator<std::unique_ptr<nox::process::Process, std::default_delete<nox::process::Process> > >::deallocate (this=0x7fffffffd9f0, __p=0x640740) at /usr/include/c++/4.9/ext/new_allocator.h:110
#11 0x000000000043359e in std::_Deque_base<std::unique_ptr<nox::process::Process, std::default_delete<nox::process::Process> >, std::allocator<std::unique_ptr<nox::process::Process, std::default_delete<nox::process::Process> > > >::_M_deallocate_node (this=0x7fffffffd9f0, __p=0x640740) at /usr/include/c++/4.9/bits/stl_deque.h:544
#12 0x0000000000432bea in std::_Deque_base<std::unique_ptr<nox::process::Process, std::default_delete<nox::process::Process> >, std::allocator<std::unique_ptr<nox::process::Process, std::default_delete<nox::process::Process> > > >::_M_destroy_nodes (this=0x7fffffffd9f0, __nstart=0x6400a8, __nfinish=0x6400b0) at /usr/include/c++/4.9/bits/stl_deque.h:647
#13 0x0000000000432203 in std::_Deque_base<std::unique_ptr<nox::process::Process, std::default_delete<nox::process::Process> >, std::allocator<std::unique_ptr<nox::process::Process, std::default_delete<nox::process::Process> > > >::~_Deque_base (this=0x7fffffffd9f0, __in_chrg=<optimized out>) at /usr/include/c++/4.9/bits/stl_deque.h:570
#14 0x00000000004317a1 in std::deque<std::unique_ptr<nox::process::Process, std::default_delete<nox::process::Process> >, std::allocator<std::unique_ptr<nox::process::Process, std::default_delete<nox::process::Process> > > >::~deque (
this=0x7fffffffd9f0, __in_chrg=<optimized out>) at /usr/include/c++/4.9/bits/stl_deque.h:923
#15 0x0000000000431522 in std::queue<std::unique_ptr<nox::process::Process, std::default_delete<nox::process::Process> >, std::deque<std::unique_ptr<nox::process::Process, std::default_delete<nox::process::Process> >, std::allocator<std::unique_ptr<nox::process::Process, std::default_delete<nox::process::Process> > > > >::~queue (this=0x7fffffffd9f0, __in_chrg=<optimized out>) at /usr/include/c++/4.9/bits/stl_queue.h:96
#16 0x0000000000430a47 in nox::process::Manager::updateProcesses (this=0x7fffffffdbd8, deltaTime=...) at /home/pimms/code/c++/samplenox/nox-engine/src/util/process/Manager.cpp:153
#17 0x00000000004292d7 in nox::app::Application::execute (this=0x7fffffffdb90) at /home/pimms/code/c++/samplenox/nox-engine/src/app/Application.cpp:188
#18 0x0000000000427819 in main (argc=1, argv=0x7fffffffdd58) at /home/pimms/code/c++/samplenox/03-updating/src/main.cpp:45

Files:

  • callgrind.03-updating - quick trace of 03-updating

  • callgrind.10-window - quick trace of 10-window

Environment:

  • Ubuntu 14.04

  • Gnome Terminal

  • NOX-Version 457ba1c

Reproduction:

Run 03-updating a couple of times, kill it using CTRL+C. To debug it properly under GDB, pass SIGINT through, but stop SIGSEGV by issuing the commands below. Kill it normally using CTRL+C under GDB - if it hangs, issue a SIGSEGV using "kill -s SIGSEGV <pid>" so you can analyze the stack further.

(gdb) handle SIGINT nostop 
(gdb) handle SIGSEGV stop nopass

Comments (14)

  1. Magnus Bjerke Vik

    I can confirm this bug.

    I don't know enough about malloc() to say this for sure, but it seems to me that this is a deadlock in malloc(). The SIGINT is received while deallocating (for me it also happened when allocating) the std::queue. The std::queue allocation calls malloc() which locks its lock, then SDL tries to push the SDL_QUIT event from the SIGINT causing malloc() to be called again which locks the lock a second time.

    I have no idea how to fix this. Are we handling SIGINT wrong? I thought SDL would handle it properly. Should do some research. Help appreciated.

  2. Magnus Bjerke Vik

    The reason that 10-window spends much less time in Manager::updateProcesses is that with a window it will vsync making it use most of its time waiting for the GPU. And because it spends much less time in Manager::updateProcesses, the probability for the bug is much lower.

  3. Joakim Stien reporter

    I would assume that using your own signal handler for SIGINT would be harmless - that sounds like a valid workaround to me.

    It may seem that malloc is misbehaving in some way, based on this article.

    ...when it comes time to free a block, the heap implementation simply attempts to acquire the relevant arena mutex.
    

    and:

    In malloc(), a test is made to see if the mutex for current target arena for the current thread is free (trylock). If so then the arena is now locked and the allocation proceeds. If the mutex is busy then each remaining arena is tried in turn and used if the mutex is not busy. In the event that no arena can be locked without blocking,  a fresh new arena is created.
    
  4. Joakim Stien reporter

    The following signals produce the locking:

    • SIGINT

    • SIGTERM

    I'll write a patch and send a PR.

    I have OS X available, but haven't set up NOX or any of it's dependencies, do you know if this is an issue under OS X as well?

  5. Joakim Stien reporter

    Apparently you create a pull request from the forked repository... ;) Created. No problem.

  6. Log in to comment