Implement graceful shutdown for Garnet server ( #1382 )#1551
Implement graceful shutdown for Garnet server ( #1382 )#1551yuseok-kim-edushare wants to merge 83 commits intomicrosoft:mainfrom
Conversation
Adds a graceful shutdown mechanism to the Garnet server, ensuring new connections are stopped, active connections are awaited, and data is safely persisted (AOF commit and checkpoint) before exit. Updates include new ShutdownAsync logic in GarnetServer, StopListening support in server classes, and integration of shutdown handling in both Windows service and console entry points.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
default 30 seconds value will be more torelant in production operations (our company's small size is enough 5 seconds, but large scale production has risks)
Change GetActiveConnectionCount return type and local accumulator from int to long and remove the redundant cast when adding garnetServerBase.get_conn_active(). This prevents potential integer overflow when summing active connections across multiple server instances; callers may need to handle the updated long return value.
replace OfType<T> to is <T>
Replace manual Stopwatch-based timeout logic with a linked CancellationTokenSource (linked to the external token) and CancelAfter(timeout). The loop now observes cts.Token for both external cancellation and timeout, and delay calls use the linked token. Improved exception handling: rethrow when the external token is canceled, log a warning when the timeout triggers, and centralize other error logging/retry behavior. This ensures correct timeout semantics and clearer error handling while waiting for active connections to close.
|
I’ve updated my code to align with the recent socket handling changes from @hamdaankhalid’s PRs ( #1646, #1648 ). I believe this PR is now in a mergeable state, as I have addressed the concerns you raised previously. However, I am still a bit worried that my approach might not fully align with your broader roadmap. I would appreciate your feedback on these fixes and the recent alignment. @badrishc, if you have some time, could you share some insights on what we might have missed? As a company, we are eager to learn from your experience with large-scale services. |
|
#1714 teaches me, about how to use async more correctly, |
- CI failure about VSTHRD002 with related PR microsoft#1714
- To reflect microsoft#1714 PR's Code Changes : TakeCheckpoint replaced with TakeCheckpointAsync
- by dotnet format, re allocate using lines
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Introduce a quiesce mechanism to coordinate shutdown across servers and sessions. Adds BeginQuiesce and IsQuiescing to IGarnetServer and implements them in GarnetServerBase (atomic flag). GarnetServer now calls BeginQuiesce on servers and the subscribe broker before stopping listeners so sessions and pub/sub fan-out stop accepting new work. SubscribeBroker gains an isQuiescing flag and drops Publish/PublishNow calls during quiesce. RespServerSession rejects new incoming commands when its server is quiescing by returning a LOADING error and closing the connection. These changes ensure no new fan-out or concurrent writers occur during shutdown, enabling a more deterministic and safe shutdown sequence.
Introduce a shutdown-timeout CLI option and wire it through the host and worker to enable configurable graceful shutdowns. Program.cs now pre-parses --shutdown-timeout, sets HostOptions.ShutdownTimeout to shutdownTimeout + a 20s data-finalization buffer (so AOF commit/checkpoint can complete), and passes the parsed timeout to Worker. Worker accepts a TimeSpan shutdownTimeout and uses it for server.ShutdownAsync instead of the previous hardcoded 5 seconds. Options.cs exposes ShutdownTimeoutSeconds with validation and default 5, and GarnetServerOptions adds a ShutdownTimeoutSeconds field and documentation. This makes connection-drain time configurable while ensuring the host shutdown budget covers data finalization.
|
umm I add a option that control shudtown wait times, So I need to fix option related test codes;; |
📌 Summary
This PR introduces a robust, graceful shutdown mechanism for Garnet, specifically addressing the data loss issues identified in #1382.
The implementation ensures a deterministic, four-phase shutdown sequence: Quiesce → Ingress Throttling → Connection Draining → Final Data Persistence (AOF/Checkpoint). This ensures the server shuts down safely without losing in-flight data, especially when running as a Windows Service or under container orchestrators.
🛠 Key Changes
1️⃣ Graceful Shutdown Implementation (Lifecycle Management)
GarnetServer.ShutdownAsync: Centralized shutdown interface that orchestrates: quiescing all sessions, stopping listeners, waiting for active connections within a configurable timeout, and committing final state (AOF or Checkpoint — one call only, per reviewer guidance).noSaveflag:ShutdownAsync(noSave: bool)allows callers to skip data finalization entirely — used during forced/OS-initiated shutdowns where the cancellation token is already signalled.FinalizeDataAsyncruns under an independent 15-secondCancellationTokenSourceso the final AOF commit/checkpoint always completes within the host's shutdown budget, regardless of the external token state.Garnet.worker) and the CLI entry point (Program.cs) to trigger the graceful shutdown sequence upon receiving termination signals (SIGINT,SIGTERM).2️⃣ Quiesce Mechanism (Pre-drain Ingress Gate)
BeginQuiesce/IsQuiescing: Added toIGarnetServerand implemented inGarnetServerBaseusing an atomic flag.GarnetServer.ShutdownAsynccallsBeginQuiesceon all servers and theSubscribeBrokerbefore stopping listeners.RespServerSessionchecksIsQuiescingon its server; if quiescing, it completes the in-flight command, replies with aLOADINGerror on the next message, and closes the connection.SubscribeBrokerdropsPublish/PublishNowcalls while quiescing, preventing new fan-out during the drain window.3️⃣ Configurable Shutdown Timeout
--shutdown-timeout <seconds>CLI option: New command-line argument parsed by bothProgram.csandGarnet.worker/Program.cs.GarnetServerOptions.ShutdownTimeoutSeconds: Exposes the timeout (default: 5 s, minimum recommended: 5 s to match Windows SCM pre-kill wait) so it is available to the host.Garnet.worker/Program.cspre-parses--shutdown-timeoutand setsHostOptions.ShutdownTimeout = connectionDrainTimeout + 20 sso the .NET host (and Windows SCM viaWindowsServiceLifetime) waits long enough for both connection draining and data finalization to complete.Worker(string[] args, TimeSpan shutdownTimeout)uses the parsed value forShutdownAsyncinstead of a hardcoded 5 s.4️⃣ Enhanced Connection Handling (Infrastructure)
StopListeningonIGarnetServerstops accepting new connections at the socket level while maintaining existing ones.isListeningflag: Thevolatile bool isListeningguard has been removed fromGarnetServerTcp— tests confirmed that catchingObjectDisposedException/SocketError.OperationAbortedon the closed listen socket is sufficient to terminate the accept loop cleanly, with no additional flag needed.GarnetServerTcpwas updated to align with concurrent upstream refactoring of socket lifecycle management.5️⃣ Comprehensive Test Coverage
ShutdownDataConsistencyTests— new test class covering data-persistence scenarios across the full AOF/Checkpoint matrix:CheckpointThenAofCommit_DataConsistencyTestAofCommitThenCheckpoint_DataConsistencyTestAofCommitOnly_DataConsistencyTestCheckpointOnly_DataConsistencyTestNoFinalization_DataConsistencyTestCheckpointThenMoreWritesThenAofCommit_DataConsistencyTestAofCommitThenMoreWritesThenCheckpoint_DataConsistencyTestGarnetServerTcpTests— updated with graceful-shutdown behavioral tests:StopListeningPreventsNewConnectionsStopListeningIdempotentStopListeningDuringActiveConnectionAttemptsShutdownAsyncCompletesGracefullyShutdownAsyncRespectsTimeoutShutdownAsyncRespectsCancellationShutdownAsyncWithAofCommit💡 Why This Approach?
noSave: truewhen the OS cancellation token is already triggered avoids a lengthy persistence step during hard kills, letting the process exit promptly.Program.csreplaceThread.Sleepwith aCancellationTokenSource-based wait, maintaining codebase consistency while enabling clean signal handling. TheisListeningflag removal simplifiesGarnetServerTcpwith no correctness regression, as proven by the new socket tests.IGarnetServer,GarnetServerBase,SubscribeBroker) for a consistent shutdown experience across standalone and Windows Service hosting.✅ Related Issues
Closes: #1382
Resolves: #1390
Reflects Discussion: r2535724513
Refactored: #1448
Performance & Reliability Optimizations
is) over LINQ (OfType<T>), reducing GC overhead on the critical shutdown path.ManualResetEventSlim/CancellationTokenSource-based waiting in the main entry point replacesThread.Sleepfor lightweight, efficient shutdown signal handling.InformationtoDebugto reduce I/O overhead during the shutdown hot path.The key additions since the original description are:
noSaveflag onShutdownAsyncfor forced-shutdown fast-exit. Furthermore Requesting support for SHUTDOWN and CLIENT PAUSE, CLIENT UNPAUSE #1004 's Shutdonw Command request can be use this logic--shutdown-timeoutisListeningflag fromGarnetServerTcp(proven unnecessary)ShutdownDataConsistencyTestsreplacing the olderGracefulShutdownTestsclass with a full AOF/Checkpoint matrixFinalizeDataAsync