-
Notifications
You must be signed in to change notification settings - Fork 218
ADR: HA failover #2598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ADR: HA failover #2598
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Summary of Changes
Hello @tac0turtle, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces an Architectural Decision Record (ADR 023) detailing a strategy for High Availability (HA) failover for sequencers. The core objective is to maintain live L2 block production and recover leadership deterministically when a primary sequencer becomes unhealthy or unreachable. The ADR proposes a '1 active + 1 failover' model and evaluates two distinct control plane designs: a Raft-based conductor system and a lease/lock-based hot standby promotion mechanism. It also specifies new administrative RPCs for ev-node
to facilitate sequencer control by these external systems, ensuring robust and observable failover capabilities.
Highlights
- Sequencer HA Strategy: Proposes operating with a "1 active + 1 failover" sequencer setup to ensure continuous L2 block production and leadership recovery without split-brain issues, regardless of the chosen control plane.
- Control Plane Designs: Evaluates two primary designs for the control plane: Design A (Rafted Conductor), which uses a Raft cluster for leader election, and Design B (Lease/Lock), where a hot standby promotes itself by acquiring a lease (e.g., Kubernetes Lease or external KV).
- New Admin RPCs: Introduces new additive Admin RPCs (
StartSequencer
,StopSequencer
,SequencerHealthy
,Status
) withinev-node
to enable external control planes to manage sequencer processes, ensuring no breaking changes to existing RPCs. - Key Requirements and SLOs: Outlines critical user requirements for the failover system, including no split-brain, deterministic recovery, fast failover (p50 ≤ 15s, p95 ≤ 45s), operational clarity, and support for zero-downtime upgrades.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command>
or @gemini-code-assist <command>
. Below is a summary of the supported commands.
Feature | Command | Description |
---|---|---|
Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/
folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a well-written and comprehensive Architecture Decision Record (ADR) for a high-availability sequencer failover system. The ADR thoroughly compares two viable approaches: a Raft-based conductor and a lease/lock mechanism, covering all necessary aspects from requirements and design to security and testing. My feedback consists of a couple of minor suggestions aimed at enhancing the document's clarity.
@@ -0,0 +1,133 @@ | |||
|
|||
# ADR 023: Sequencer Recovery & Liveness — Rafted Conductor vs 1‑Active/1‑Failover |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current title "Rafted Conductor vs 1‑Active/1‑Failover" could be slightly confusing because the document explains that both proposed designs (Rafted Conductor and Lease/Lock) implement a "1-Active/1-Failover" strategy. To improve clarity, consider retitling to focus on the two mechanisms being compared, for example: Sequencer Recovery & Liveness: Rafted Conductor vs. Lease/Lock
.
|
||
## Changelog | ||
|
||
- 2025-08-21: Initial ADR authored; compared approaches and captured failover and escape‑hatch semantics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The term "escape-hatch" is mentioned here in the changelog but isn't defined or used elsewhere in the ADR. To improve clarity, consider replacing it with a term that is described in the document, such as "break-glass overrides" (mentioned in the Security Considerations section), or adding a definition for what an "escape-hatch" entails in this context.
Claude finished @tac0turtle's task —— View job Pull Request ReviewI've thoroughly reviewed this ADR for HA sequencer failover. Overall, this is a well-structured and comprehensive architectural decision record that addresses a critical production need. Here's my detailed feedback: Todo List:
Code Quality & Best PracticesStrengths:
Areas for improvement:
Security ReviewStrong points:
Concerns:
Performance ConsiderationsPositive aspects:
Potential issues:
Architectural ReviewDesign strengths:
Missing considerations:
Test Coverage AssessmentGood coverage planned:
Missing test scenarios:
Technical Issues
RecommendationsHigh Priority:
Medium Priority:
Low Priority:
Overall AssessmentThis is a solid architectural foundation for HA sequencer failover. The two-design approach is pragmatic, and the separation of concerns is well thought out. The main gaps are in security specifications and operational details, but the core design is sound and implementable. The ADR demonstrates good understanding of distributed systems challenges and provides a clear path forward for implementation. |
Just to clarify, there will be n nodes from the same "operator". They will be running all the time, only one will run as aggregator, the rest are not (but still run with the same priv node key). In case someone detects that the aggregator has a problem, another of those n nodes that is not an aggregator will be notified via RAFT network, and will take the role of aggregator. Questions
|
|
||
### Negative | ||
- Design A adds Raft operational overhead (quorum management, snapshots). | ||
- Design B has a smaller blast radius but does not generalize to N replicas; stricter reliance on correct fencing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the chosen implementation, the sequencer stack still may possess a single point of failure (e.g kv store)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the failure being the other node is not up to date with the latest state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about the external kv store availability.
For testing, a local file is fine, but for production (devnet, testnet, mainnet) a chain in HA mode with an external KV store can be just as fault-vulnerable as one running in standard mode.
Assuming the external KV store is exposed via TCP, high availability must cover:
- DNS resolution
- the KV store service itself
- a load balancer in front of the KV store
If the operator fails to provide proper HA for any of these components, the sequencer stack still has a single point of failure and is not truly HA, even if the ev-node is running in HA mode.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the kv store will always be local to the node, we dont support adding remote kv stores (dbs)
- None beyond existing node telemetry; no user data added. | ||
|
||
### Testing plan | ||
- Kill active sequencer → verify failover within SLO; assert **no double leadership**. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Design A, we should also kill the conductor on the active sequencer so that others conductors can experience a timeout from the conductor leader.
yes it is, there will be a new api that enables the block production
the new api, its missing from the ADR, ill look at adding it into the ADR to provide more information |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good start! I added some questions
bool from_unsafe_head = 1; // if false, uses safe head per policy | ||
bytes lease_token = 2; // opaque, issued by control plane (Raft/Lease) | ||
string reason = 3; // audit string | ||
string idempotency_key = 4; // optional, de-duplicate retries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on the usage of the field, please?
} | ||
|
||
message StopSequencerRequest { | ||
bytes lease_token = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 Is the lease token really required in any of the processes? The conductor is managing the leadership lock
UnsafeHead unsafe = 3; | ||
} | ||
|
||
message CompleteHandoffRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can there be multiple hand-off process in-flight? what happens if the hand-off does not complete? Is there a timeout to consider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the handoff paths are manually triggered so it a "controlled environment". It will be treated as FCFS meaning that if one is inflight others are rejected, if there handoff is not completed within the timeout then it will fallback to another if present. Ill expand on that in the adr
- StopSequencer: Hard stop with optional “force” semantics. | ||
- PrepareHandoff / CompleteHandoff: Explicit, auditable, two-phase, blue/green leadership transfer. | ||
- Health / Status: Health probes and machine-readable node + leader state. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would need some "dead man switch" for the node to refresh the "running" permission, in case the conductor dies before stopping the node. The "start" command can come with a timeout that defines the range for the max refresh interval.
|
||
message PrepareHandoffRequest { | ||
bytes lease_token = 1; | ||
string target_id = 2; // logical target node ID |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this field please? It is not clear to me what it should contain or how it would be used
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this will be the address of the node key.
} | ||
|
||
message LeadershipTerm { | ||
uint64 term = 1; // monotonic term/epoch for fencing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate on this field? Is this a block number or how can this be used for the starting node?
bool stopped = 1; | ||
} | ||
|
||
message PrepareHandoffRequest { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The workflow is not fully clear to me. Is this sent to the current leader, the new leader or both sequencers?
- Design A adds Raft operational overhead (quorum management, snapshots). | ||
- Design B has a smaller blast radius but does not generalize to N replicas; stricter reliance on correct fencing. | ||
- Additional components (sidecars, proxies) increase deployment surface. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be helpful for me to have some sequence diagrams that show the flow how the handover works. Happy path first and unhappy path for the edge cases
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #2598 +/- ##
==========================================
- Coverage 72.41% 72.30% -0.11%
==========================================
Files 72 72
Lines 7394 7406 +12
==========================================
+ Hits 5354 5355 +1
- Misses 1600 1611 +11
Partials 440 440
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Overview
This PR proposes an approach to having a HA failover system in the off chance a sequencer goes down