Skip to content

Implement graceful shutdown for Garnet server ( #1382 )#1551

Open
yuseok-kim-edushare wants to merge 83 commits intomicrosoft:mainfrom
yuseok-kim-edushare:yuseok-kim/graceful_shutdown
Open

Implement graceful shutdown for Garnet server ( #1382 )#1551
yuseok-kim-edushare wants to merge 83 commits intomicrosoft:mainfrom
yuseok-kim-edushare:yuseok-kim/graceful_shutdown

Conversation

@yuseok-kim-edushare
Copy link
Copy Markdown
Contributor

@yuseok-kim-edushare yuseok-kim-edushare commented Feb 8, 2026

📌 Summary

This PR introduces a robust, graceful shutdown mechanism for Garnet, specifically addressing the data loss issues identified in #1382.

The implementation ensures a deterministic, four-phase shutdown sequence: Quiesce → Ingress Throttling → Connection Draining → Final Data Persistence (AOF/Checkpoint). This ensures the server shuts down safely without losing in-flight data, especially when running as a Windows Service or under container orchestrators.


🛠 Key Changes

1️⃣ Graceful Shutdown Implementation (Lifecycle Management)

  • GarnetServer.ShutdownAsync: Centralized shutdown interface that orchestrates: quiescing all sessions, stopping listeners, waiting for active connections within a configurable timeout, and committing final state (AOF or Checkpoint — one call only, per reviewer guidance).
  • noSave flag: ShutdownAsync(noSave: bool) allows callers to skip data finalization entirely — used during forced/OS-initiated shutdowns where the cancellation token is already signalled.
  • Bounded data-finalization timeout: FinalizeDataAsync runs under an independent 15-second CancellationTokenSource so the final AOF commit/checkpoint always completes within the host's shutdown budget, regardless of the external token state.
  • Worker & Entry Point Integration: Updated the Windows Service worker (Garnet.worker) and the CLI entry point (Program.cs) to trigger the graceful shutdown sequence upon receiving termination signals (SIGINT, SIGTERM).

2️⃣ Quiesce Mechanism (Pre-drain Ingress Gate)

  • BeginQuiesce / IsQuiescing: Added to IGarnetServer and implemented in GarnetServerBase using an atomic flag. GarnetServer.ShutdownAsync calls BeginQuiesce on all servers and the SubscribeBroker before stopping listeners.
  • Session-level rejection: RespServerSession checks IsQuiescing on its server; if quiescing, it completes the in-flight command, replies with a LOADING error on the next message, and closes the connection.
  • Pub/Sub gate: SubscribeBroker drops Publish/PublishNow calls while quiescing, preventing new fan-out during the drain window.

3️⃣ Configurable Shutdown Timeout

  • --shutdown-timeout <seconds> CLI option: New command-line argument parsed by both Program.cs and Garnet.worker/Program.cs.
  • GarnetServerOptions.ShutdownTimeoutSeconds: Exposes the timeout (default: 5 s, minimum recommended: 5 s to match Windows SCM pre-kill wait) so it is available to the host.
  • Host budget auto-calculation: Garnet.worker/Program.cs pre-parses --shutdown-timeout and sets HostOptions.ShutdownTimeout = connectionDrainTimeout + 20 s so the .NET host (and Windows SCM via WindowsServiceLifetime) waits long enough for both connection draining and data finalization to complete.
  • Worker receives typed timeout: Worker(string[] args, TimeSpan shutdownTimeout) uses the parsed value for ShutdownAsync instead of a hardcoded 5 s.

4️⃣ Enhanced Connection Handling (Infrastructure)

  • Listener Control: StopListening on IGarnetServer stops accepting new connections at the socket level while maintaining existing ones.
  • Removed isListening flag: The volatile bool isListening guard has been removed from GarnetServerTcp — tests confirmed that catching ObjectDisposedException / SocketError.OperationAborted on the closed listen socket is sufficient to terminate the accept loop cleanly, with no additional flag needed.
  • Socket management alignment: GarnetServerTcp was updated to align with concurrent upstream refactoring of socket lifecycle management.

5️⃣ Comprehensive Test Coverage

  • ShutdownDataConsistencyTests — new test class covering data-persistence scenarios across the full AOF/Checkpoint matrix:

    • CheckpointThenAofCommit_DataConsistencyTest
    • AofCommitThenCheckpoint_DataConsistencyTest
    • AofCommitOnly_DataConsistencyTest
    • CheckpointOnly_DataConsistencyTest
    • NoFinalization_DataConsistencyTest
    • CheckpointThenMoreWritesThenAofCommit_DataConsistencyTest
    • AofCommitThenMoreWritesThenCheckpoint_DataConsistencyTest
  • GarnetServerTcpTests — updated with graceful-shutdown behavioral tests:

    • StopListeningPreventsNewConnections
    • StopListeningIdempotent
    • StopListeningDuringActiveConnectionAttempts
    • ShutdownAsyncCompletesGracefully
    • ShutdownAsyncRespectsTimeout
    • ShutdownAsyncRespectsCancellation
    • ShutdownAsyncWithAofCommit

💡 Why This Approach?

  • Data Integrity: A single final AOF commit or checkpoint (not both) after all connections are drained guarantees the store state is fully persisted. The bounded 15 s finalization window ensures this always completes within the host's shutdown budget.
  • Deterministic Ingress Shutdown: Quiescing sessions before closing the listen socket eliminates the race where in-flight commands arrive after the drain window starts, enabling a stricter "no new writes after shutdown signal" invariant.
  • Force-shutdown Safety: Passing noSave: true when the OS cancellation token is already triggered avoids a lengthy persistence step during hard kills, letting the process exit promptly.
  • Configurable Timeout: Making the connection-drain timeout a CLI option means operators can tune it to their workload without recompiling, while the automatic host-budget calculation prevents the Windows SCM from force-killing before finalization finishes.
  • Minimal Invasive Change: Modifications to Program.cs replace Thread.Sleep with a CancellationTokenSource-based wait, maintaining codebase consistency while enabling clean signal handling. The isListening flag removal simplifies GarnetServerTcp with no correctness regression, as proven by the new socket tests.
  • Architectural Alignment: The quiesce and shutdown designs extend existing Garnet abstractions (IGarnetServer, GarnetServerBase, SubscribeBroker) for a consistent shutdown experience across standalone and Windows Service hosting.

✅ Related Issues

  • Closes: #1382

  • Resolves: #1390

  • Reflects Discussion: r2535724513

  • Refactored: #1448

    Performance & Reliability Optimizations

    • Allocation Reduction: Active connection counting uses C# 7.0+ pattern matching (is) over LINQ (OfType<T>), reducing GC overhead on the critical shutdown path.
    • Signal Synchronization: ManualResetEventSlim / CancellationTokenSource-based waiting in the main entry point replaces Thread.Sleep for lightweight, efficient shutdown signal handling.
    • Log Level Tuning: Verbose shutdown-path logging downgraded from Information to Debug to reduce I/O overhead during the shutdown hot path.

The key additions since the original description are:

  • Quiesce mechanism (new pre-drain ingress gate on sessions and pub/sub)
  • noSave flag on ShutdownAsync for forced-shutdown fast-exit. Furthermore Requesting support for SHUTDOWN and CLIENT PAUSE, CLIENT UNPAUSE #1004 's Shutdonw Command request can be use this logic
  • Configurable --shutdown-timeout
  • Removal of isListening flag from GarnetServerTcp (proven unnecessary)
  • ShutdownDataConsistencyTests replacing the older GracefulShutdownTests class with a full AOF/Checkpoint matrix
  • Bounded 15 s finalization window in FinalizeDataAsync

yuseok-kim-edushare and others added 30 commits November 27, 2025 00:03
Adds a graceful shutdown mechanism to the Garnet server, ensuring new connections are stopped, active connections are awaited, and data is safely persisted (AOF commit and checkpoint) before exit. Updates include new ShutdownAsync logic in GarnetServer, StopListening support in server classes, and integration of shutdown handling in both Windows service and console entry points.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
default 30 seconds value will be more torelant in production operations
(our company's small size is enough 5 seconds, but large scale production has risks)
Change GetActiveConnectionCount return type and local accumulator from int to long and remove the redundant cast when adding garnetServerBase.get_conn_active(). This prevents potential integer overflow when summing active connections across multiple server instances; callers may need to handle the updated long return value.
replace OfType<T> to is <T>
Replace manual Stopwatch-based timeout logic with a linked CancellationTokenSource (linked to the external token) and CancelAfter(timeout). The loop now observes cts.Token for both external cancellation and timeout, and delay calls use the linked token. Improved exception handling: rethrow when the external token is canceled, log a warning when the timeout triggers, and centralize other error logging/retry behavior. This ensures correct timeout semantics and clearer error handling while waiting for active connections to close.
@yuseok-kim-edushare
Copy link
Copy Markdown
Contributor Author

Looks like my PR has some conflicts after the latest merged #1648 , #1646 , #1645 PRs
Since I’ll be on PTO tomorrow, I’ll take a look and fix it then.

@yuseok-kim-edushare
Copy link
Copy Markdown
Contributor Author

​I’ve updated my code to align with the recent socket handling changes from @hamdaankhalid’s PRs ( #1646, #1648 ).

​I believe this PR is now in a mergeable state, as I have addressed the concerns you raised previously.

However, I am still a bit worried that my approach might not fully align with your broader roadmap. I would appreciate your feedback on these fixes and the recent alignment.

@badrishc, if you have some time, could you share some insights on what we might have missed? As a company, we are eager to learn from your experience with large-scale services.

@yuseok-kim-edushare
Copy link
Copy Markdown
Contributor Author

#1714 teaches me, about how to use async more correctly,
I will reflect that PR's enhancement into my PR in soon

- by dotnet format, re allocate using lines
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 10 out of 10 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread libs/host/GarnetServer.cs Outdated
Comment thread libs/host/GarnetServer.cs
Comment thread main/GarnetServer/Program.cs Outdated
Comment thread hosting/Windows/Garnet.worker/Worker.cs Outdated
yuseok-kim-edushare and others added 3 commits May 3, 2026 01:44
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Introduce a quiesce mechanism to coordinate shutdown across servers and sessions. Adds BeginQuiesce and IsQuiescing to IGarnetServer and implements them in GarnetServerBase (atomic flag). GarnetServer now calls BeginQuiesce on servers and the subscribe broker before stopping listeners so sessions and pub/sub fan-out stop accepting new work. SubscribeBroker gains an isQuiescing flag and drops Publish/PublishNow calls during quiesce. RespServerSession rejects new incoming commands when its server is quiescing by returning a LOADING error and closing the connection. These changes ensure no new fan-out or concurrent writers occur during shutdown, enabling a more deterministic and safe shutdown sequence.
Introduce a shutdown-timeout CLI option and wire it through the host and worker to enable configurable graceful shutdowns. Program.cs now pre-parses --shutdown-timeout, sets HostOptions.ShutdownTimeout to shutdownTimeout + a 20s data-finalization buffer (so AOF commit/checkpoint can complete), and passes the parsed timeout to Worker. Worker accepts a TimeSpan shutdownTimeout and uses it for server.ShutdownAsync instead of the previous hardcoded 5 seconds. Options.cs exposes ShutdownTimeoutSeconds with validation and default 5, and GarnetServerOptions adds a ShutdownTimeoutSeconds field and documentation. This makes connection-drain time configurable while ensuring the host shutdown budget covers data finalization.
@yuseok-kim-edushare
Copy link
Copy Markdown
Contributor Author

umm I add a option that control shudtown wait times, So I need to fix option related test codes;;

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants