Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 7 additions & 2 deletions src/threadutils.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -860,8 +860,13 @@ class ThreadPool_c final : public Worker_i
ScopedMutex_t dLock { m_dMutex };
m_bStop = true;
m_dWork.reset ();
if ( sphIsDied() )
m_tService.stop();
// Always signal the service to stop so every idle worker is woken and exits. This used to be
// guarded by sphIsDied(): on a graceful shutdown the workers were only released if m_dWork.reset()
// happened to drive m_iOutstandingWork to 0. If anything else still held outstanding work (e.g. a
// galera service thread that outlived replication teardown), the count never reached 0, stop() was
// never called, the idle workers stayed parked in Wait(), and the Join() below deadlocked the
// daemon forever during `searchd --stopwait`.
m_tService.stop();
dLock.Unlock ();
LOG ( DEBUG, TP ) << "stopping thread pool";
LOGINFO ( TPLIFE, TP ) << "stopping thread pool";
Expand Down
70 changes: 70 additions & 0 deletions test/clt-tests/sharding/rollback/daemon-shutdown-deadlock.rec
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
Daemon shutdown deadlock: graceful `searchd --stopwait` of a node hangs in ThreadPool_c::StopAll() during a sharded-table rebalance on rejoin. Pure daemon bug (no Buddy involved). Expected: graceful stop returns 0; buggy daemon never returns (timeout -> exit 124).

––– comment –––
Start 3-node cluster
––– input –––
export INSTANCE=1
––– output –––
––– block: ../../base/replication/start-searchd-precach –––
––– input –––
export INSTANCE=2
––– output –––
––– block: ../../base/replication/start-searchd-precach –––
––– input –––
export INSTANCE=3
––– output –––
––– block: ../../base/replication/start-searchd-precach –––
––– input –––
export CLUSTER_NAME=c TABLE_NAME=t
––– output –––
––– block: ../../base/replication/create-cluster –––
––– block: ../../base/replication/join-cluster-on-all-nodes –––
––– comment –––
Create sharded table with RF=2 (shards spread across nodes; node1 holds both shards)
––– input –––
mysql -h0 -P1306 -e "CREATE TABLE ${CLUSTER_NAME}:${TABLE_NAME} (id bigint, account string, amount float, ts int) shards='2' rf='2'"; echo $?
––– output –––
0
––– input –––
mysql -h0 -P1306 -e "INSERT INTO ${TABLE_NAME} (id, account, amount, ts) VALUES (1, 'ACC001', 100.50, 1000), (2, 'ACC002', 200.75, 2000), (3, 'ACC003', 150.25, 3000), (4, 'ACC001', 300.00, 4000), (5, 'ACC002', 250.50, 5000)"; echo $?
––– output –––
0
––– comment –––
First failure: kill node 1 gracefully (this stop works), insert during outage, restart node 1
––– input –––
export INSTANCE=1; stdbuf -oL searchd --stopwait -c test/clt-tests/base/searchd-with-flexible-ports.conf > /dev/null; echo "Node 1 killed"
––– output –––
Node 1 killed
––– input –––
timeout 30 bash -c 'while lsof -i :${INSTANCE}306 &>/dev/null; do sleep 1; done'
––– output –––
––– input –––
timeout 10 grep -qm1 'becoming master' <(tail -n 1000 -f /var/log/manticore-{2,3}/searchd.log 2>/dev/null); echo $?
––– output –––
0
––– input –––
mysql -h0 -P2306 -e "INSERT INTO ${TABLE_NAME} (id, account, amount, ts) VALUES (6, 'ACC003', 175.00, 6000), (7, 'ACC001', 225.75, 7000)"; echo $?
––– output –––
0
––– input –––
export INSTANCE=1
––– output –––
––– block: ../../base/replication/start-searchd-precach –––
––– comment –––
Wait for node 1 to rejoin and cluster to reach primary on all nodes
––– input –––
timeout 60 bash -c 'until for i in 1 2 3; do mysql -h0 -P${i}306 -e "SHOW STATUS LIKE '"'"'cluster_c_status'"'"'\G" 2>/dev/null | grep -q "Value: primary" || exit 1; done; do sleep 1; done'; echo "rejoined=$?"
––– output –––
rejoined=0
––– comment –––
Second failure: stop node 2 GRACEFULLY during the sharded rebalance. This is where the daemon deadlocks: searchd --stopwait enters Shutdown() -> ThreadPool_c::StopAll() and blocks forever joining a worker thread that was never signalled to exit (a galera ServiceThd outlives provider destruction). Guarded with timeout 40 so the test fails fast instead of hanging. A healthy daemon returns exit 0; the buggy daemon yields exit 124 (timed out = deadlocked).
––– input –––
export INSTANCE=2; timeout 40 searchd --stopwait -c test/clt-tests/base/searchd-with-flexible-ports.conf >/dev/null 2>&1; echo "node2_stopwait_exit=$?"
––– output –––
node2_stopwait_exit=0
––– comment –––
Cleanup (force-kill any deadlocked node so the container exits cleanly)
––– input –––
pkill -9 searchd 2>/dev/null; sleep 1; echo done
––– output –––
done
Loading