Hold per-DB checkpoint locks until all general-BGSAVE per-DB checkpoints complete#1796
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes a race in multi-database background checkpointing where a general BGSAVE could release an individual DB’s checkpoint lock as soon as that DB finished, allowing a subsequent BGSAVE <dbId> to sometimes succeed mid-flight and flake MultiDatabaseSaveInProgressTest (notably in fast CI Release runs). The change centralizes ownership of per-DB lock resumption so locks remain held until the full general save completes.
Changes:
- Moved per-DB checkpoint lock resumption responsibility out of
RunPausedCheckpointAsyncand into its callers. - Updated the general/per-DB
BGSAVEhelper to resume all pre-paused DB checkpoint locks only afterTask.WhenAllcompletes. - Adjusted the AOF-size-driven checkpoint path to resume the paused DB lock in its outer
finally.
06fcc46 to
7a46983
Compare
|
Addressed in 7a46983: reworded the doc comment so the parameter range reads as plain prose instead of mixing |
7a46983 to
da825f2
Compare
|
Pushed e2d43a3 to address the residual flake on https://github.com/microsoft/garnet/actions/runs/25773997865. The previous server-side fix correctly held per-DB checkpoint locks until ALL per-DB checkpoints in a general save completed. But for ~1MB of small in-memory data, the actual checkpoint takes microseconds-to-milliseconds, which is comparable to (and sometimes shorter than) the client→server roundtrip between issuing the general This commit replaces those two sequential calls with a single pipelined
15/15 runs pass locally in Release. Full MultiDatabaseTests suite (31/31) still passes. |
…nts complete Fixes a residual race in PR #1767 that caused MultiDatabaseSaveInProgressTest to flake in CI Release builds. The general BGSAVE path synchronously paused all per-DB checkpoint locks before returning 'Background saving started', but the per-DB checkpoint helper released each per-DB lock as soon as that single DB's checkpoint completed - not when the entire general save finished. If DB 0's checkpoint completed before the test's 'BGSAVE 0' arrived over the wire, BGSAVE 0 would succeed instead of failing with 'ERR checkpoint already in progress'. Locally the test takes 6-7s and the race never loses; in CI Release it ran in 1s and reliably failed. See https://github.com/microsoft/garnet/actions/runs/25757540604/job/75650328662. Fix: - RunPausedCheckpointsAndReleaseLocksAsync (used by both general and per-DB BGSAVE) resumes ALL pre-paused DBs in its outer finally, after Task.WhenAll. So per-DB locks are held until ALL per-DB checkpoints complete, not just each individual one. A per-DB BGSAVE issued mid-flight reliably observes the in-progress checkpoint. - The per-DB checkpoint inner work is now a local async function TakeOneCheckpointAsync that performs only (TakeCheckpointAsync + UpdateLastSaveData) without resuming. - Pre-fill checkpointTasks[] with Task.CompletedTask so the catch path can safely double-await even if the synchronous task-creation loop throws partway through. The double-await ensures we never resume a per-DB lock while its checkpoint is still running. - Remove the handedOffCount partial-resume bookkeeping that's no longer needed. - The previously-shared RunPausedCheckpointAsync helper is removed - its only other caller (TaskCheckpointBasedOnAofSizeLimitAsync) now inlines the same try/checkpoint/ update/finally/resume sequence so its single-DB pause-resume lifecycle is visible in one place. Net effect: - General BGSAVE: per-DB locks held until ALL per-DB checkpoints complete, so any per-DB BGSAVE issued mid-flight reliably fails with 'checkpoint already in progress'. - Per-DB BGSAVE alone (single-DB path through RunPausedCheckpointsAndReleaseLocksAsync with pausedCount=1): unchanged - that single per-DB lock is still released exactly when that single checkpoint completes. - AOF-size-driven checkpoint: behaviorally unchanged (lock cleanup inlined). - Other legal scenarios (per-DB then per-DB on different DB, per-DB then general, general blocks general) preserved. Verification: 10/10 runs in Release config of MultiDatabaseSaveInProgressTest + MultiDatabaseGeneralSaveBlocksGeneralSaveTest, full MultiDatabaseTests suite (31/31) passes locally. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…inate roundtrip race
Even with the previous server-side fix that holds all per-DB checkpoint locks until
the entire general save completes, the test still flaked in CI Release builds because
the actual checkpoint of ~1MB of in-memory data completes in microseconds-to-milliseconds.
That window is comparable to (and sometimes shorter than) the client→server roundtrip
between issuing the general BGSAVE and the follow-up BGSAVE 0:
Failed Garnet.test.MultiDatabaseTests.MultiDatabaseSaveInProgressTest [1 s]
Error Message:
ERR checkpoint already in progress
Assert.That(caughtException, expression)
Expected: <StackExchange.Redis.RedisServerException>
But was: null
(see https://github.com/microsoft/garnet/actions/runs/25773997865)
Replace the two sequential SE.Redis Execute calls with a single LightClient pipelined
send of 'BGSAVE\r\nBGSAVE 0\r\n'. Both commands arrive at the server in the same
network packet, so the server processes BGSAVE 0 immediately after the general BGSAVE's
synchronous setup completes - while DB 0's per-DB checkpoint lock is still held. This
makes the assertion deterministic regardless of how fast the actual checkpoint runs.
CountResponseType.Bytes is used because RESP token-counting in LightClient only treats
'-' as an error marker at position 0, so a pipelined response containing two RESP
tokens '+...\r\n-...\r\n' would never trigger CompletePendingRequests under the
default Tokens mode.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This test has been timing out in CI. Set an explicit 180s cancellation timeout so the shared ClusterTestContext.cts is configured accordingly and polling loops (BackOff(cts.Token)) can exit cleanly instead of hanging until the test runner kills the process. Matches the convention applied in PR #1767 for MultipleReplicasWithVectorSetsAndDeletesAsync. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
caf0ab0 to
68db976
Compare
Summary
Fixes a residual race in #1767 that caused
MultiDatabaseTests.MultiDatabaseSaveInProgressTestto flake in CI Release builds (e.g. https://github.com/microsoft/garnet/actions/runs/25757540604/job/75650328662).Root cause
#1767 made the general
BGSAVEsynchronously pause all per-DB checkpoint locks (TryPauseCheckpoints(id)) before returningBackground saving started, so a subsequentBGSAVE <dbId>would observe the in-progress checkpoint and fail. ButRunPausedCheckpointAsyncreleased each per-DB lock infinallyas soon as that single DB's checkpoint completed, not when the entire general save finished.In the failing test:
If DB 0's checkpoint completed before
BGSAVE 0arrived over the wire, BGSAVE 0 succeeded and the assertion failed. Locally the test takes 6-7s and the race never loses; in CI Release it ran in 1s and reliably failed.Fix
In
libs/server/Databases/MultiDatabaseManager.cs:RunPausedCheckpointAsync: removed theResumeCheckpoints(dbId)from itsfinallyblock — caller now owns the resume.RunPausedCheckpointsAndReleaseLocksAsync(used by both general and per-DB BGSAVE): resumes all pre-paused DBs in its outerfinally, afterTask.WhenAll. Pre-fillscheckpointTasks[]withTask.CompletedTaskand double-awaits in the catch block so a synchronous task-creation throw cannot leave a per-DB checkpoint running while its lock is being resumed. ThehandedOffCountpartial-resume logic is removed — no longer needed since the helper no longer self-resumes.TaskCheckpointBasedOnAofSizeLimitAsync(the only other caller ofRunPausedCheckpointAsync, used by AOF-size-driven checkpoints): hoistspausedDbIdto outer scope and callsResumeCheckpoints(pausedDbId)in its outerfinally.Net effect
checkpoint already in progress. ✓pausedCount=1): unchanged — that single lock is still released exactly when that single checkpoint completes.multiDbCheckpointingLock)Verification
MultiDatabaseSaveInProgressTestpass in Release config locally.MultiDatabaseTestssuite (31/31) passes locally.