Race condition when two SIGHUPs occur back to back

Issue #928 resolved
Andy Spitzer created an issue

Two SIGHUPs (monit reload) in quick succession can hang the main thread.

A SIGHUP ends up setting the http/engine.c:stopped flag via the main thread, which then waits for the http thread to exit, and then reloads and starts a new http thread.

If another SIGHUP arrives before the new http thread gets to the place where she clears the stopped flag, the main thread again sets the http/engine.c:stopped flag, and waits for the thread to exit. However, as the http thread was still starting up, it would then CLEAR the stopped flag, and never exit, leaving the main thread stuck waiting for her via pthread_join()

If the 2nd SIGHUP comes after the http thread starts, it works fine.
If the 2nd SIGHUP comes before the http thread starts...hang.

This bash command can often show the issue (although, being a race condition, these things are hard to always trigger!)

# for i in $(seq 1 1000); do monit reload; done

Symptoms are main monit process is stuck in thread_join(), yet the http thread is constantly polling away every second.

# strace -fp 13474

strace: Process 13474 attached with 2 threads
[pid  2831] restart_syscall(<... resuming interrupted poll ...> <unfinished ...>
[pid 13474] futex(0x7f7263c789d0, FUTEX_WAIT, 2831, NULL <unfinished ...>
[pid  2831] <... restart_syscall resumed>) = 0
[pid  2831] poll([{fd=5, events=POLLIN}], 1, 1000) = 0 (Timeout)
[pid  2831] poll([{fd=5, events=POLLIN}], 1, 1000) = 0 (Timeout)

Here we see the main thread (pid 13474) stuck in futex() (aka pthread_join), yet the httpd thread (pid 2831) is happily doing it's thing, and not stopping.

Attached is a patch against 5.27.0 that corrects the issue by having the main thread clear the ‘stopped’ flag, and not the http thread.

--Andy “Woof!” Spitzer

Comments (2)

  1. Log in to comment