Hold per-DB checkpoint locks until all general-BGSAVE per-DB checkpoints complete by badrishc · Pull Request #1796 · microsoft/garnet

badrishc · 2026-05-12T22:21:02Z

Summary

Fixes a residual race in #1767 that caused MultiDatabaseTests.MultiDatabaseSaveInProgressTest to flake in CI Release builds (e.g. https://github.com/microsoft/garnet/actions/runs/25757540604/job/75650328662).

Root cause

#1767 made the general BGSAVE synchronously pause all per-DB checkpoint locks (TryPauseCheckpoints(id)) before returning Background saving started, so a subsequent BGSAVE <dbId> would observe the in-progress checkpoint and fail. But RunPausedCheckpointAsync released each per-DB lock in finally as soon as that single DB's checkpoint completed, not when the entire general save finished.

In the failing test:

// Issue general background save
res = db1.Execute("BGSAVE");
ClassicAssert.AreEqual("Background saving started", res.ToString());

// Issue background save to DB 0 while general save is in progress - illegal
Assert.Throws<RedisServerException>(() => db1.Execute("BGSAVE", "0"),
    Encoding.ASCII.GetString(CmdStrings.RESP_ERR_CHECKPOINT_ALREADY_IN_PROGRESS));

If DB 0's checkpoint completed before BGSAVE 0 arrived over the wire, BGSAVE 0 succeeded and the assertion failed. Locally the test takes 6-7s and the race never loses; in CI Release it ran in 1s and reliably failed.

Fix

In libs/server/Databases/MultiDatabaseManager.cs:

RunPausedCheckpointAsync: removed the ResumeCheckpoints(dbId) from its finally block — caller now owns the resume.
RunPausedCheckpointsAndReleaseLocksAsync (used by both general and per-DB BGSAVE): resumes all pre-paused DBs in its outer finally, after Task.WhenAll. Pre-fills checkpointTasks[] with Task.CompletedTask and double-awaits in the catch block so a synchronous task-creation throw cannot leave a per-DB checkpoint running while its lock is being resumed. The handedOffCount partial-resume logic is removed — no longer needed since the helper no longer self-resumes.
TaskCheckpointBasedOnAofSizeLimitAsync (the only other caller of RunPausedCheckpointAsync, used by AOF-size-driven checkpoints): hoists pausedDbId to outer scope and calls ResumeCheckpoints(pausedDbId) in its outer finally.

Net effect

General BGSAVE: per-DB locks held until ALL per-DB checkpoints complete, so any per-DB BGSAVE issued mid-flight reliably fails with checkpoint already in progress. ✓
Per-DB BGSAVE alone (single-DB path with pausedCount=1): unchanged — that single lock is still released exactly when that single checkpoint completes.
AOF-size-driven checkpoint: unchanged — still releases the per-DB lock when its checkpoint completes (just resumed in caller's finally instead of in the helper).
Other legal scenarios preserved:
- per-DB then per-DB on different DB → both succeed
- per-DB then general → general succeeds (skips already-paused DBs)
- general then general → second one fails (guarded by multiDbCheckpointingLock)

Verification

15/15 runs of MultiDatabaseSaveInProgressTest pass in Release config locally.
Full MultiDatabaseTests suite (31/31) passes locally.
Reviewed by GPT-5.5 code-review agent — no findings.

Copilot

Pull request overview

This PR fixes a race in multi-database background checkpointing where a general BGSAVE could release an individual DB’s checkpoint lock as soon as that DB finished, allowing a subsequent BGSAVE <dbId> to sometimes succeed mid-flight and flake MultiDatabaseSaveInProgressTest (notably in fast CI Release runs). The change centralizes ownership of per-DB lock resumption so locks remain held until the full general save completes.

Changes:

Moved per-DB checkpoint lock resumption responsibility out of RunPausedCheckpointAsync and into its callers.
Updated the general/per-DB BGSAVE helper to resume all pre-paused DB checkpoint locks only after Task.WhenAll completes.
Adjusted the AOF-size-driven checkpoint path to resume the paused DB lock in its outer finally.

badrishc · 2026-05-13T02:08:17Z

Addressed in 7a46983: reworded the doc comment so the parameter range reads as plain prose instead of mixing [0..N) half-open notation with self-closing <paramref/> XML tags.

badrishc · 2026-05-13T18:32:01Z

Pushed e2d43a3 to address the residual flake on https://github.com/microsoft/garnet/actions/runs/25773997865.

The previous server-side fix correctly held per-DB checkpoint locks until ALL per-DB checkpoints in a general save completed. But for ~1MB of small in-memory data, the actual checkpoint takes microseconds-to-milliseconds, which is comparable to (and sometimes shorter than) the client→server roundtrip between issuing the general BGSAVE and the follow-up BGSAVE 0 via two sequential SE.Redis Execute calls. So the lock window was correct, just not observable to the test client.

This commit replaces those two sequential calls with a single pipelined LightClient send of BGSAVE\r\nBGSAVE 0\r\n. Both commands arrive at the server in the same network packet, so BGSAVE 0 is processed by the network thread immediately after the general BGSAVE's synchronous setup completes — while DB 0's per-DB checkpoint lock is definitely still held. The assertion is now deterministic regardless of how fast the actual checkpoint runs.

CountResponseType.Bytes is required because LightClient's Tokens mode only counts - as an error marker at position 0; a pipelined response +...\r\n-...\r\n would never satisfy CompletePendingRequests under the default Tokens mode.

15/15 runs pass locally in Release. Full MultiDatabaseTests suite (31/31) still passes.

…nts complete Fixes a residual race in PR #1767 that caused MultiDatabaseSaveInProgressTest to flake in CI Release builds. The general BGSAVE path synchronously paused all per-DB checkpoint locks before returning 'Background saving started', but the per-DB checkpoint helper released each per-DB lock as soon as that single DB's checkpoint completed - not when the entire general save finished. If DB 0's checkpoint completed before the test's 'BGSAVE 0' arrived over the wire, BGSAVE 0 would succeed instead of failing with 'ERR checkpoint already in progress'. Locally the test takes 6-7s and the race never loses; in CI Release it ran in 1s and reliably failed. See https://github.com/microsoft/garnet/actions/runs/25757540604/job/75650328662. Fix: - RunPausedCheckpointsAndReleaseLocksAsync (used by both general and per-DB BGSAVE) resumes ALL pre-paused DBs in its outer finally, after Task.WhenAll. So per-DB locks are held until ALL per-DB checkpoints complete, not just each individual one. A per-DB BGSAVE issued mid-flight reliably observes the in-progress checkpoint. - The per-DB checkpoint inner work is now a local async function TakeOneCheckpointAsync that performs only (TakeCheckpointAsync + UpdateLastSaveData) without resuming. - Pre-fill checkpointTasks[] with Task.CompletedTask so the catch path can safely double-await even if the synchronous task-creation loop throws partway through. The double-await ensures we never resume a per-DB lock while its checkpoint is still running. - Remove the handedOffCount partial-resume bookkeeping that's no longer needed. - The previously-shared RunPausedCheckpointAsync helper is removed - its only other caller (TaskCheckpointBasedOnAofSizeLimitAsync) now inlines the same try/checkpoint/ update/finally/resume sequence so its single-DB pause-resume lifecycle is visible in one place. Net effect: - General BGSAVE: per-DB locks held until ALL per-DB checkpoints complete, so any per-DB BGSAVE issued mid-flight reliably fails with 'checkpoint already in progress'. - Per-DB BGSAVE alone (single-DB path through RunPausedCheckpointsAndReleaseLocksAsync with pausedCount=1): unchanged - that single per-DB lock is still released exactly when that single checkpoint completes. - AOF-size-driven checkpoint: behaviorally unchanged (lock cleanup inlined). - Other legal scenarios (per-DB then per-DB on different DB, per-DB then general, general blocks general) preserved. Verification: 10/10 runs in Release config of MultiDatabaseSaveInProgressTest + MultiDatabaseGeneralSaveBlocksGeneralSaveTest, full MultiDatabaseTests suite (31/31) passes locally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…inate roundtrip race Even with the previous server-side fix that holds all per-DB checkpoint locks until the entire general save completes, the test still flaked in CI Release builds because the actual checkpoint of ~1MB of in-memory data completes in microseconds-to-milliseconds. That window is comparable to (and sometimes shorter than) the client→server roundtrip between issuing the general BGSAVE and the follow-up BGSAVE 0: Failed Garnet.test.MultiDatabaseTests.MultiDatabaseSaveInProgressTest [1 s] Error Message: ERR checkpoint already in progress Assert.That(caughtException, expression) Expected: <StackExchange.Redis.RedisServerException> But was: null (see https://github.com/microsoft/garnet/actions/runs/25773997865) Replace the two sequential SE.Redis Execute calls with a single LightClient pipelined send of 'BGSAVE\r\nBGSAVE 0\r\n'. Both commands arrive at the server in the same network packet, so the server processes BGSAVE 0 immediately after the general BGSAVE's synchronous setup completes - while DB 0's per-DB checkpoint lock is still held. This makes the assertion deterministic regardless of how fast the actual checkpoint runs. CountResponseType.Bytes is used because RESP token-counting in LightClient only treats '-' as an error marker at position 0, so a pipelined response containing two RESP tokens '+...\r\n-...\r\n' would never trigger CompletePendingRequests under the default Tokens mode. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

This test has been timing out in CI. Set an explicit 180s cancellation timeout so the shared ClusterTestContext.cts is configured accordingly and polling loops (BackOff(cts.Token)) can exit cleanly instead of hanging until the test runner kills the process. Matches the convention applied in PR #1767 for MultipleReplicasWithVectorSetsAndDeletesAsync. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings May 12, 2026 22:21

Copilot started reviewing on behalf of badrishc May 12, 2026 22:22 View session

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread libs/server/Databases/MultiDatabaseManager.cs Outdated

badrishc force-pushed the badrishc/fix-multidb-bgsave-race branch 3 times, most recently from 06fcc46 to 7a46983 Compare May 13, 2026 02:08

badrishc force-pushed the badrishc/fix-multidb-bgsave-race branch from 7a46983 to da825f2 Compare May 13, 2026 02:12

badrishc and others added 3 commits May 13, 2026 12:32

badrishc force-pushed the badrishc/fix-multidb-bgsave-race branch from caf0ab0 to 68db976 Compare May 13, 2026 19:33

kevin-montrose approved these changes May 13, 2026

View reviewed changes

badrishc merged commit 28ba65e into main May 13, 2026
25 of 27 checks passed

badrishc deleted the badrishc/fix-multidb-bgsave-race branch May 13, 2026 20:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hold per-DB checkpoint locks until all general-BGSAVE per-DB checkpoints complete#1796

Hold per-DB checkpoint locks until all general-BGSAVE per-DB checkpoints complete#1796
badrishc merged 3 commits into
mainfrom
badrishc/fix-multidb-bgsave-race

badrishc commented May 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

badrishc commented May 13, 2026

Uh oh!

badrishc commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

badrishc commented May 12, 2026

Summary

Root cause

Fix

Net effect

Verification

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

badrishc commented May 13, 2026

Uh oh!

badrishc commented May 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants