ADR: HA failover #2598

tac0turtle · 2025-08-21T14:00:37Z

Overview

This PR proposes an approach to having a HA failover system in the off chance a sequencer goes down

gemini-code-assist

Summary of Changes

Hello @tac0turtle, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an Architectural Decision Record (ADR 023) detailing a strategy for High Availability (HA) failover for sequencers. The core objective is to maintain live L2 block production and recover leadership deterministically when a primary sequencer becomes unhealthy or unreachable. The ADR proposes a '1 active + 1 failover' model and evaluates two distinct control plane designs: a Raft-based conductor system and a lease/lock-based hot standby promotion mechanism. It also specifies new administrative RPCs for ev-node to facilitate sequencer control by these external systems, ensuring robust and observable failover capabilities.

Highlights

Sequencer HA Strategy: Proposes operating with a "1 active + 1 failover" sequencer setup to ensure continuous L2 block production and leadership recovery without split-brain issues, regardless of the chosen control plane.
Control Plane Designs: Evaluates two primary designs for the control plane: Design A (Rafted Conductor), which uses a Raft cluster for leader election, and Design B (Lease/Lock), where a hot standby promotes itself by acquiring a lease (e.g., Kubernetes Lease or external KV).
New Admin RPCs: Introduces new additive Admin RPCs (StartSequencer, StopSequencer, SequencerHealthy, Status) within ev-node to enable external control planes to manage sequencer processes, ensuring no breaking changes to existing RPCs.
Key Requirements and SLOs: Outlines critical user requirements for the failover system, including no split-brain, deterministic recovery, fast failover (p50 ≤ 15s, p95 ≤ 45s), operational clarity, and support for zero-downtime upgrades.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

github-actions · 2025-08-21T14:00:57Z

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`❌ failed (14)`	`✅ passed`	Aug 29, 2025, 8:44 AM

github-actions · 2025-08-21T14:01:53Z

PR Preview Action v1.6.2
🚀 View preview at https://evstack.github.io/docs-preview/pr-2598/
Built to branch `main` at 2025-08-29 08:44 UTC. Preview will be ready when the GitHub Pages deployment is complete.

gemini-code-assist

Code Review

This pull request introduces a well-written and comprehensive Architecture Decision Record (ADR) for a high-availability sequencer failover system. The ADR thoroughly compares two viable approaches: a Raft-based conductor and a lease/lock mechanism, covering all necessary aspects from requirements and design to security and testing. My feedback consists of a couple of minor suggestions aimed at enhancing the document's clarity.

gemini-code-assist · 2025-08-21T14:01:54Z

docs/adr/ADR-023-sequencer-recovery.md

@@ -0,0 +1,133 @@
+
+# ADR 023: Sequencer Recovery & Liveness — Rafted Conductor vs 1‑Active/1‑Failover


The current title "Rafted Conductor vs 1‑Active/1‑Failover" could be slightly confusing because the document explains that both proposed designs (Rafted Conductor and Lease/Lock) implement a "1-Active/1-Failover" strategy. To improve clarity, consider retitling to focus on the two mechanisms being compared, for example: Sequencer Recovery & Liveness: Rafted Conductor vs. Lease/Lock.

gemini-code-assist · 2025-08-21T14:01:54Z

docs/adr/ADR-023-sequencer-recovery.md

+
+## Changelog
+
+- 2025-08-21: Initial ADR authored; compared approaches and captured failover and escape‑hatch semantics.


The term "escape-hatch" is mentioned here in the changelog but isn't defined or used elsewhere in the ADR. To improve clarity, consider replacing it with a term that is described in the document, such as "break-glass overrides" (mentioned in the Security Considerations section), or adding a definition for what an "escape-hatch" entails in this context.

docs/adr/ADR-023-sequencer-recovery.md

claude · 2025-08-25T09:02:08Z

Claude finished @tac0turtle's task —— View job

Pull Request Review

I've thoroughly reviewed this ADR for HA sequencer failover. Overall, this is a well-structured and comprehensive architectural decision record that addresses a critical production need. Here's my detailed feedback:

Todo List:

Read and analyze ADR-023-sequencer-recovery.md
Read and analyze control.proto file
Review architectural design decisions
Check for security considerations
Evaluate performance implications
Review test coverage approach
Provide comprehensive feedback

Code Quality & Best Practices

Strengths:

Excellent documentation structure: The ADR follows proper format with clear context, alternatives, decision rationale, and consequences
Well-defined API: The protobuf schema is clean, well-commented, and follows good naming conventions
Comprehensive coverage: Addresses operational concerns, security, testing, and monitoring
Industry alignment: References OP Stack's op-conductor showing awareness of proven patterns

Areas for improvement:

Missing API versioning strategy: While the proto package includes v1, there's no discussion of how breaking changes to the admin API will be handled in future versions
Error handling specificity: The error semantics section could benefit from more specific error codes for common failure scenarios (e.g., different types of fencing violations)

Security Review

Strong points:

mTLS + RBAC authentication properly specified
Fencing mechanisms to prevent split-brain scenarios
Audit logging for all critical operations
Admin API properly isolated from public endpoints

Concerns:

lease_token security (control.proto:45, 57, 68): The lease token is defined as bytes but lacks specification of:
- Token generation/validation algorithms
- Expiration mechanisms
- Cryptographic properties (should it be signed? encrypted?)
Break-glass procedures (ADR-023-sequencer-recovery.md:204): Mentioned but not detailed. Critical security operations need explicit procedures and access controls
Handoff ticket security (control.proto:75): Similar to lease tokens, handoff tickets need cryptographic specification to prevent forgery

Performance Considerations

Positive aspects:

Acknowledges minimal overhead for both designs
Realistic SLO targets (p50 ≤ 15s, p95 ≤ 45s failover)
Separate admin API prevents performance impact on public endpoints

Potential issues:

Raft snapshot frequency: No specification of snapshot intervals which could impact memory usage and recovery time
Lease renewal frequency: Design B lacks specification of renewal intervals vs. detection time trade-offs

Architectural Review

Design strengths:

Clear separation between control plane (Raft/Lease) and data plane (ev-node)
Flexible approach supporting both designs based on operational needs
UnsafeHead tracking ensures proper state continuity
Two-phase handoff (PrepareHandoff/CompleteHandoff) enables zero-downtime upgrades

Missing considerations:

Observability gaps:
- No specification for health check intervals
- Missing metrics for handoff duration/success rates
- No alerting thresholds defined
Edge case handling:
- What happens if both nodes fail simultaneously?
- Recovery procedures when control plane data is corrupted
- Network partition scenarios beyond what's briefly mentioned
State synchronization: How does a standby node ensure it's caught up before promotion? The ADR mentions this briefly but lacks implementation details.

Test Coverage Assessment

Good coverage planned:

Kill tests for failover validation
Partition tests for split-brain prevention
Blue/green handoff testing
Long-duration outage drills

Missing test scenarios:

Concurrent API calls: What happens if multiple StartSequencer calls happen simultaneously?
Token/ticket replay attacks: Security testing for cryptographic components
Performance degradation testing: How does the system behave under resource pressure?
Configuration drift: Testing mismatched configurations between nodes

Technical Issues

Proto field consistency: The HealthResponse message uses uint64 for l2_number while UnsafeHead uses the same type - good consistency.
Missing validation: No specification of input validation (e.g., target_id format in PrepareHandoffRequest)
Idempotency implementation: While idempotency keys are included, there's no specification of how they're validated or stored

Recommendations

High Priority:

Add cryptographic specifications for lease_token and handoff_ticket
Define break-glass procedures and access controls
Specify health check and lease renewal intervals
Add input validation requirements

Medium Priority:

Define API versioning and backward compatibility strategy
Add more detailed error scenarios and recovery procedures
Specify observability metrics thresholds and alerting rules
Add edge case handling procedures

Low Priority:

Consider adding request/response size limits
Add performance benchmarking requirements to testing plan

Overall Assessment

This is a solid architectural foundation for HA sequencer failover. The two-design approach is pragmatic, and the separation of concerns is well thought out. The main gaps are in security specifications and operational details, but the core design is sound and implementable.

The ADR demonstrates good understanding of distributed systems challenges and provides a clear path forward for implementation.

randygrok · 2025-08-25T09:29:28Z

Just to clarify, there will be n nodes from the same "operator".

They will be running all the time, only one will run as aggregator, the rest are not (but still run with the same priv node key).

In case someone detects that the aggregator has a problem, another of those n nodes that is not an aggregator will be notified via RAFT network, and will take the role of aggregator.

Questions

Is it possible now to run two nodes with same priv key even if one is not validator?
How a node becomes aggregator, by restarting it with the aggregator flag?

auricom · 2025-08-25T10:00:35Z

docs/adr/ADR-023-sequencer-recovery.md

+
+### Negative
+- Design A adds Raft operational overhead (quorum management, snapshots).
+- Design B has a smaller blast radius but does not generalize to N replicas; stricter reliance on correct fencing.


Depending on the chosen implementation, the sequencer stack still may possess a single point of failure (e.g kv store)

the failure being the other node is not up to date with the latest state?

I was thinking about the external kv store availability.

For testing, a local file is fine, but for production (devnet, testnet, mainnet) a chain in HA mode with an external KV store can be just as fault-vulnerable as one running in standard mode.

Assuming the external KV store is exposed via TCP, high availability must cover:

DNS resolution

the KV store service itself

a load balancer in front of the KV store

If the operator fails to provide proper HA for any of these components, the sequencer stack still has a single point of failure and is not truly HA, even if the ev-node is running in HA mode.

the kv store will always be local to the node, we dont support adding remote kv stores (dbs)

docs/adr/ADR-023-sequencer-recovery.md

auricom · 2025-08-25T10:15:38Z

docs/adr/ADR-023-sequencer-recovery.md

+- None beyond existing node telemetry; no user data added.
+
+### Testing plan
+- Kill active sequencer → verify failover within SLO; assert **no double leadership**.


For Design A, we should also kill the conductor on the active sequencer so that others conductors can experience a timeout from the conductor leader.

tac0turtle · 2025-08-25T12:25:40Z

Is it possible now to run two nodes with same priv key even if one is not validator?

yes it is, there will be a new api that enables the block production

How a node becomes aggregator, by restarting it with the aggregator flag?

the new api, its missing from the ADR, ill look at adding it into the ADR to provide more information

alpe

Good start! I added some questions

alpe · 2025-08-28T08:18:29Z

docs/adr/ADR-023-sequencer-recovery.md

+  bool   from_unsafe_head = 1;  // if false, uses safe head per policy
+  bytes  lease_token      = 2;  // opaque, issued by control plane (Raft/Lease)
+  string reason           = 3;  // audit string
+  string idempotency_key  = 4;  // optional, de-duplicate retries


Can you elaborate on the usage of the field, please?

alpe · 2025-08-28T08:20:08Z

docs/adr/ADR-023-sequencer-recovery.md

+}
+
+message StopSequencerRequest {
+  bytes  lease_token     = 1;


🤔 Is the lease token really required in any of the processes? The conductor is managing the leadership lock

alpe · 2025-08-28T08:23:52Z

proto/evnode/admin/v1/control.proto

+  UnsafeHead      unsafe         = 3;
+}
+
+message CompleteHandoffRequest {


Can there be multiple hand-off process in-flight? what happens if the hand-off does not complete? Is there a timeout to consider?

the handoff paths are manually triggered so it a "controlled environment". It will be treated as FCFS meaning that if one is inflight others are rejected, if there handoff is not completed within the timeout then it will fallback to another if present. Ill expand on that in the adr

alpe · 2025-08-28T08:31:06Z

docs/adr/ADR-023-sequencer-recovery.md

+- StopSequencer: Hard stop with optional “force” semantics.
+- PrepareHandoff / CompleteHandoff: Explicit, auditable, two-phase, blue/green leadership transfer.
+- Health / Status: Health probes and machine-readable node + leader state.
+


We would need some "dead man switch" for the node to refresh the "running" permission, in case the conductor dies before stopping the node. The "start" command can come with a timeout that defines the range for the max refresh interval.

alpe · 2025-08-28T08:32:55Z

docs/adr/ADR-023-sequencer-recovery.md

+
+message PrepareHandoffRequest {
+  bytes  lease_token     = 1;
+  string target_id       = 2; // logical target node ID


Can you elaborate on this field please? It is not clear to me what it should contain or how it would be used

this will be the address of the node key.

alpe · 2025-08-28T08:37:47Z

docs/adr/ADR-023-sequencer-recovery.md

+}
+
+message LeadershipTerm {
+  uint64 term      = 1; // monotonic term/epoch for fencing


Can you elaborate on this field? Is this a block number or how can this be used for the starting node?

alpe · 2025-08-28T08:39:19Z

docs/adr/ADR-023-sequencer-recovery.md

+  bool stopped = 1;
+}
+
+message PrepareHandoffRequest {


The workflow is not fully clear to me. Is this sent to the current leader, the new leader or both sequencers?

alpe · 2025-08-28T08:42:38Z

docs/adr/ADR-023-sequencer-recovery.md

+- Design A adds Raft operational overhead (quorum management, snapshots).
+- Design B has a smaller blast radius but does not generalize to N replicas; stricter reliance on correct fencing.
+- Additional components (sidecars, proxies) increase deployment surface.
+


It would be helpful for me to have some sequence diagrams that show the flow how the handover works. Happy path first and unhappy path for the edge cases

codecov · 2025-08-29T08:49:10Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.30%. Comparing base (f483445) to head (5e0e181).
⚠️ Report is 13 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2598      +/-   ##
==========================================
- Coverage   72.41%   72.30%   -0.11%     
==========================================
  Files          72       72              
  Lines        7394     7406      +12     
==========================================
+ Hits         5354     5355       +1     
- Misses       1600     1611      +11     
  Partials      440      440

Flag	Coverage Δ
combined	`72.30% <ø> (-0.11%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

create adr for possible failover approach

66b0087

github-project-automation bot added this to Evolve Aug 21, 2025

github-actions bot assigned tac0turtle Aug 21, 2025

gemini-code-assist bot reviewed Aug 21, 2025

View reviewed changes

julienrbrt reviewed Aug 22, 2025

View reviewed changes

docs/adr/ADR-023-sequencer-recovery.md Outdated Show resolved Hide resolved

docs/adr/ADR-023-sequencer-recovery.md Show resolved Hide resolved

docs/adr/ADR-023-sequencer-recovery.md Outdated Show resolved Hide resolved

docs/adr/ADR-023-sequencer-recovery.md Outdated Show resolved Hide resolved

address comments

8199fa4

auricom reviewed Aug 25, 2025

View reviewed changes

docs/adr/ADR-023-sequencer-recovery.md Outdated Show resolved Hide resolved

auricom reviewed Aug 25, 2025

View reviewed changes

tac0turtle added 2 commits August 25, 2025 18:15

add proto

5ca36c0

add in metrics from comment

d46d829

alpe reviewed Aug 28, 2025

View reviewed changes

reduce complexity and add diagrams

5e0e181

		@@ -0,0 +1,133 @@

		# ADR 023: Sequencer Recovery & Liveness — Rafted Conductor vs 1‑Active/1‑Failover


		## Changelog

		- 2025-08-21: Initial ADR authored; compared approaches and captured failover and escape‑hatch semantics.

ADR: HA failover #2598

Are you sure you want to change the base?

ADR: HA failover #2598

Uh oh!

Conversation

tac0turtle commented Aug 21, 2025

Overview

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

github-actions bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Built to branch main at 2025-08-29 08:44 UTC. Preview will be ready when the GitHub Pages deployment is complete.

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 21, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

claude bot commented Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Review

Code Quality & Best Practices

Security Review

Performance Considerations

Architectural Review

Test Coverage Assessment

Technical Issues

Recommendations

Overall Assessment

Uh oh!

randygrok commented Aug 25, 2025

Uh oh!

auricom Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

auricom Aug 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tac0turtle commented Aug 25, 2025

Uh oh!

alpe left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

github-actions bot commented Aug 21, 2025 •

edited

Loading

github-actions bot commented Aug 21, 2025 •

edited

Loading

Built to branch `main` at 2025-08-29 08:44 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

claude bot commented Aug 25, 2025 •

edited

Loading

auricom Aug 25, 2025 •

edited

Loading

auricom Aug 25, 2025 •

edited

Loading