Skip to content

Commit eeb7dde

Browse files
committed
[#31029] YSQL: Avoid ExternalMiniCluster master/tserver startup race in initdb
ExternalMiniCluster exhibits the same master/tserver startup race as MiniYBCluster (see dcd8d1d for the underlying bug). The C++ test harness needs two pieces of wiring: * In StartMaster(), set --TEST_master_min_live_tservers_before_initdb=num_tablet_servers next to --master_auto_run_initdb so the master waits for tservers before launching initdb. Mirrors the Java-side wiring in MiniYBCluster. * In Start(), move the WaitForInitDb() call out of StartMasters() and to the point right after tservers have been launched. The previous ordering -- WaitForInitDb() synchronously inside StartMasters() -- would deadlock with the new flag, because tservers are only launched after StartMasters() returns. ExternalMiniCluster::StartMasters() has no other callers in the codebase, so this deferral is safe. This fix unmasks the ASAN LSan leak fix (45086aa) for the PgObjectLocksTest family in src/yb/yql/pgwrapper/pg_object_locks-test.cc. Without it, the LibPqTestBase-based tests in that family FAIL during cluster bootstrap under ASAN, hiding the LSan fix's effect. Test plan: * asan-clang21 local, repeated, with --skip-cxx-build --skip-java-build -n N --tp 1: -> 5/5 PASSED for PgObjectLocksTest.VerifyLockTimeout (ticket #30090). -> 3/3 PASSED each for the four sibling tests Sergei flagged in #30090: PgObjectLocksTestMixModeDuringPromotion.TestMixModeDuringPromotion PgObjectLocksTestMixModeDuringPromotion.TestLocksTakenAfterPromotionWithExistingConnections PgObjectLocksTestRF1.TestShutdownWithWaiters PgObjectLocksTestRF1.TestDisableReuseAbortedPlainTxn Without this change, the three LibPqTestBase-based tests fail during cluster bootstrap. The two PgMiniTestBase-based RF1 tests pass with or without this change (in-process MiniCluster does not have the race). * Production code path is gated behind the TEST flag defaulting to 0, so no observable change for any non-ExternalMiniCluster caller (and StartMasters() has no callers outside ExternalMiniCluster::Start()).
1 parent 9461ec3 commit eeb7dde

1 file changed

Lines changed: 21 additions & 3 deletions

File tree

src/yb/integration-tests/external_mini_cluster.cc

Lines changed: 21 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -406,6 +406,13 @@ Status ExternalMiniCluster::Start(rpc::Messenger* messenger) {
406406
} else {
407407
LOG(INFO) << "No need to start tablet servers";
408408
}
409+
// WaitForInitDb is called here, after tservers have launched, rather than inside
410+
// StartMasters(). The master gates initdb on the TEST_master_min_live_tservers_before_initdb
411+
// flag set in StartMaster(), so initdb only completes after the tservers above register.
412+
// StartMasters() is only called from this method, so the deferral is safe.
413+
if (opts_.enable_ysql) {
414+
RETURN_NOT_OK(WaitForInitDb());
415+
}
409416
if (opts_.enable_ysql && opts_.wait_for_tservers_to_accept_ysql_connections) {
410417
RETURN_NOT_OK(WaitForTabletServersToAcceptYSQLConnection(
411418
MonoTime::Now() + kTabletServerRegistrationTimeout));
@@ -563,6 +570,16 @@ Result<ExternalMasterPtr> ExternalMiniCluster::StartMaster(
563570
if (opts_.enable_ysql) {
564571
flags.push_back("--enable_ysql=true");
565572
flags.push_back("--master_auto_run_initdb");
573+
// Avoid GitHub #31029 startup race: initdb's postgres bootstrap calls CreateTable on the
574+
// master (to create the global transaction status table) and fails with "Not enough live
575+
// tablet servers ..." if no tservers have registered yet. Have the master wait for the
576+
// tservers we are about to start before launching initdb. Mirrors the Java-side fix in
577+
// MiniYBCluster.java. num_tablet_servers is the right ceiling because (a) we are about
578+
// to start exactly that many tservers immediately after, and (b) the master's RF
579+
// requirement is min(num_tablet_servers, FLAGS_replication_factor), so any test that
580+
// survives the bootstrap eventually needs all of them anyway.
581+
flags.push_back(Format(
582+
"--TEST_master_min_live_tservers_before_initdb=$0", opts_.num_tablet_servers));
566583
if (opts_.enable_ysql_auth) {
567584
flags.push_back("--ysql_enable_auth=true");
568585
}
@@ -1281,9 +1298,10 @@ Status ExternalMiniCluster::StartMasters() {
12811298
masters_.push_back(master);
12821299
}
12831300

1284-
if (opts_.enable_ysql) {
1285-
RETURN_NOT_OK(WaitForInitDb());
1286-
}
1301+
// Note: WaitForInitDb() is intentionally not called here. It is called from Start() after
1302+
// tservers have launched, so that the master's wait on
1303+
// TEST_master_min_live_tservers_before_initdb does not deadlock against tservers that
1304+
// haven't been started yet. See the comment block in Start() for details.
12871305

12881306
// Trigger an election to avoid an unnecessary 3s wait on every cluster startup.
12891307
if (!masters_.empty()) {

0 commit comments

Comments
 (0)