Skip to content

Conversation

tac0turtle
Copy link
Contributor

Overview

This PR proposes an approach to having a HA failover system in the off chance a sequencer goes down

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @tac0turtle, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces an Architectural Decision Record (ADR 023) detailing a strategy for High Availability (HA) failover for sequencers. The core objective is to maintain live L2 block production and recover leadership deterministically when a primary sequencer becomes unhealthy or unreachable. The ADR proposes a '1 active + 1 failover' model and evaluates two distinct control plane designs: a Raft-based conductor system and a lease/lock-based hot standby promotion mechanism. It also specifies new administrative RPCs for ev-node to facilitate sequencer control by these external systems, ensuring robust and observable failover capabilities.

Highlights

  • Sequencer HA Strategy: Proposes operating with a "1 active + 1 failover" sequencer setup to ensure continuous L2 block production and leadership recovery without split-brain issues, regardless of the chosen control plane.
  • Control Plane Designs: Evaluates two primary designs for the control plane: Design A (Rafted Conductor), which uses a Raft cluster for leader election, and Design B (Lease/Lock), where a hot standby promotes itself by acquiring a lease (e.g., Kubernetes Lease or external KV).
  • New Admin RPCs: Introduces new additive Admin RPCs (StartSequencer, StopSequencer, SequencerHealthy, Status) within ev-node to enable external control planes to manage sequencer processes, ensuring no breaking changes to existing RPCs.
  • Key Requirements and SLOs: Outlines critical user requirements for the failover system, including no split-brain, deterministic recovery, fast failover (p50 ≤ 15s, p95 ≤ 45s), operational clarity, and support for zero-downtime upgrades.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

github-actions bot commented Aug 21, 2025

The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed⏩ skipped❌ failed (14)✅ passedAug 29, 2025, 8:44 AM

Copy link
Contributor

github-actions bot commented Aug 21, 2025

PR Preview Action v1.6.2

🚀 View preview at
https://evstack.github.io/docs-preview/pr-2598/

Built to branch main at 2025-08-29 08:44 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a well-written and comprehensive Architecture Decision Record (ADR) for a high-availability sequencer failover system. The ADR thoroughly compares two viable approaches: a Raft-based conductor and a lease/lock mechanism, covering all necessary aspects from requirements and design to security and testing. My feedback consists of a couple of minor suggestions aimed at enhancing the document's clarity.

@@ -0,0 +1,133 @@

# ADR 023: Sequencer Recovery & Liveness — Rafted Conductor vs 1‑Active/1‑Failover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current title "Rafted Conductor vs 1‑Active/1‑Failover" could be slightly confusing because the document explains that both proposed designs (Rafted Conductor and Lease/Lock) implement a "1-Active/1-Failover" strategy. To improve clarity, consider retitling to focus on the two mechanisms being compared, for example: Sequencer Recovery & Liveness: Rafted Conductor vs. Lease/Lock.


## Changelog

- 2025-08-21: Initial ADR authored; compared approaches and captured failover and escape‑hatch semantics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The term "escape-hatch" is mentioned here in the changelog but isn't defined or used elsewhere in the ADR. To improve clarity, consider replacing it with a term that is described in the document, such as "break-glass overrides" (mentioned in the Security Considerations section), or adding a definition for what an "escape-hatch" entails in this context.

Copy link
Contributor

claude bot commented Aug 25, 2025

Claude finished @tac0turtle's task —— View job


Pull Request Review

I've thoroughly reviewed this ADR for HA sequencer failover. Overall, this is a well-structured and comprehensive architectural decision record that addresses a critical production need. Here's my detailed feedback:

Todo List:

  • Read and analyze ADR-023-sequencer-recovery.md
  • Read and analyze control.proto file
  • Review architectural design decisions
  • Check for security considerations
  • Evaluate performance implications
  • Review test coverage approach
  • Provide comprehensive feedback

Code Quality & Best Practices

Strengths:

  • Excellent documentation structure: The ADR follows proper format with clear context, alternatives, decision rationale, and consequences
  • Well-defined API: The protobuf schema is clean, well-commented, and follows good naming conventions
  • Comprehensive coverage: Addresses operational concerns, security, testing, and monitoring
  • Industry alignment: References OP Stack's op-conductor showing awareness of proven patterns

Areas for improvement:

  1. Missing API versioning strategy: While the proto package includes v1, there's no discussion of how breaking changes to the admin API will be handled in future versions

  2. Error handling specificity: The error semantics section could benefit from more specific error codes for common failure scenarios (e.g., different types of fencing violations)

Security Review

Strong points:

  • mTLS + RBAC authentication properly specified
  • Fencing mechanisms to prevent split-brain scenarios
  • Audit logging for all critical operations
  • Admin API properly isolated from public endpoints

Concerns:

  1. lease_token security (control.proto:45, 57, 68): The lease token is defined as bytes but lacks specification of:

    • Token generation/validation algorithms
    • Expiration mechanisms
    • Cryptographic properties (should it be signed? encrypted?)
  2. Break-glass procedures (ADR-023-sequencer-recovery.md:204): Mentioned but not detailed. Critical security operations need explicit procedures and access controls

  3. Handoff ticket security (control.proto:75): Similar to lease tokens, handoff tickets need cryptographic specification to prevent forgery

Performance Considerations

Positive aspects:

  • Acknowledges minimal overhead for both designs
  • Realistic SLO targets (p50 ≤ 15s, p95 ≤ 45s failover)
  • Separate admin API prevents performance impact on public endpoints

Potential issues:

  1. Raft snapshot frequency: No specification of snapshot intervals which could impact memory usage and recovery time
  2. Lease renewal frequency: Design B lacks specification of renewal intervals vs. detection time trade-offs

Architectural Review

Design strengths:

  • Clear separation between control plane (Raft/Lease) and data plane (ev-node)
  • Flexible approach supporting both designs based on operational needs
  • UnsafeHead tracking ensures proper state continuity
  • Two-phase handoff (PrepareHandoff/CompleteHandoff) enables zero-downtime upgrades

Missing considerations:

  1. Observability gaps:

    • No specification for health check intervals
    • Missing metrics for handoff duration/success rates
    • No alerting thresholds defined
  2. Edge case handling:

    • What happens if both nodes fail simultaneously?
    • Recovery procedures when control plane data is corrupted
    • Network partition scenarios beyond what's briefly mentioned
  3. State synchronization: How does a standby node ensure it's caught up before promotion? The ADR mentions this briefly but lacks implementation details.

Test Coverage Assessment

Good coverage planned:

  • Kill tests for failover validation
  • Partition tests for split-brain prevention
  • Blue/green handoff testing
  • Long-duration outage drills

Missing test scenarios:

  1. Concurrent API calls: What happens if multiple StartSequencer calls happen simultaneously?
  2. Token/ticket replay attacks: Security testing for cryptographic components
  3. Performance degradation testing: How does the system behave under resource pressure?
  4. Configuration drift: Testing mismatched configurations between nodes

Technical Issues

  1. Proto field consistency: The HealthResponse message uses uint64 for l2_number while UnsafeHead uses the same type - good consistency.

  2. Missing validation: No specification of input validation (e.g., target_id format in PrepareHandoffRequest)

  3. Idempotency implementation: While idempotency keys are included, there's no specification of how they're validated or stored

Recommendations

High Priority:

  1. Add cryptographic specifications for lease_token and handoff_ticket
  2. Define break-glass procedures and access controls
  3. Specify health check and lease renewal intervals
  4. Add input validation requirements

Medium Priority:

  1. Define API versioning and backward compatibility strategy
  2. Add more detailed error scenarios and recovery procedures
  3. Specify observability metrics thresholds and alerting rules
  4. Add edge case handling procedures

Low Priority:

  1. Consider adding request/response size limits
  2. Add performance benchmarking requirements to testing plan

Overall Assessment

This is a solid architectural foundation for HA sequencer failover. The two-design approach is pragmatic, and the separation of concerns is well thought out. The main gaps are in security specifications and operational details, but the core design is sound and implementable.

The ADR demonstrates good understanding of distributed systems challenges and provides a clear path forward for implementation.

@randygrok
Copy link
Contributor

Just to clarify, there will be n nodes from the same "operator".

They will be running all the time, only one will run as aggregator, the rest are not (but still run with the same priv node key).

In case someone detects that the aggregator has a problem, another of those n nodes that is not an aggregator will be notified via RAFT network, and will take the role of aggregator.

Questions

  • Is it possible now to run two nodes with same priv key even if one is not validator?
  • How a node becomes aggregator, by restarting it with the aggregator flag?


### Negative
- Design A adds Raft operational overhead (quorum management, snapshots).
- Design B has a smaller blast radius but does not generalize to N replicas; stricter reliance on correct fencing.
Copy link
Contributor

@auricom auricom Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the chosen implementation, the sequencer stack still may possess a single point of failure (e.g kv store)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the failure being the other node is not up to date with the latest state?

Copy link
Contributor

@auricom auricom Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the external kv store availability.

For testing, a local file is fine, but for production (devnet, testnet, mainnet) a chain in HA mode with an external KV store can be just as fault-vulnerable as one running in standard mode.

Assuming the external KV store is exposed via TCP, high availability must cover:

  • DNS resolution
  • the KV store service itself
  • a load balancer in front of the KV store

If the operator fails to provide proper HA for any of these components, the sequencer stack still has a single point of failure and is not truly HA, even if the ev-node is running in HA mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the kv store will always be local to the node, we dont support adding remote kv stores (dbs)

- None beyond existing node telemetry; no user data added.

### Testing plan
- Kill active sequencer → verify failover within SLO; assert **no double leadership**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Design A, we should also kill the conductor on the active sequencer so that others conductors can experience a timeout from the conductor leader.

@tac0turtle
Copy link
Contributor Author

  • Is it possible now to run two nodes with same priv key even if one is not validator?

yes it is, there will be a new api that enables the block production

  • How a node becomes aggregator, by restarting it with the aggregator flag?

the new api, its missing from the ADR, ill look at adding it into the ADR to provide more information

Copy link
Contributor

@alpe alpe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good start! I added some questions

bool from_unsafe_head = 1; // if false, uses safe head per policy
bytes lease_token = 2; // opaque, issued by control plane (Raft/Lease)
string reason = 3; // audit string
string idempotency_key = 4; // optional, de-duplicate retries
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on the usage of the field, please?

}

message StopSequencerRequest {
bytes lease_token = 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 Is the lease token really required in any of the processes? The conductor is managing the leadership lock

UnsafeHead unsafe = 3;
}

message CompleteHandoffRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be multiple hand-off process in-flight? what happens if the hand-off does not complete? Is there a timeout to consider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the handoff paths are manually triggered so it a "controlled environment". It will be treated as FCFS meaning that if one is inflight others are rejected, if there handoff is not completed within the timeout then it will fallback to another if present. Ill expand on that in the adr

- StopSequencer: Hard stop with optional “force” semantics.
- PrepareHandoff / CompleteHandoff: Explicit, auditable, two-phase, blue/green leadership transfer.
- Health / Status: Health probes and machine-readable node + leader state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need some "dead man switch" for the node to refresh the "running" permission, in case the conductor dies before stopping the node. The "start" command can come with a timeout that defines the range for the max refresh interval.


message PrepareHandoffRequest {
bytes lease_token = 1;
string target_id = 2; // logical target node ID
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this field please? It is not clear to me what it should contain or how it would be used

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will be the address of the node key.

}

message LeadershipTerm {
uint64 term = 1; // monotonic term/epoch for fencing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on this field? Is this a block number or how can this be used for the starting node?

bool stopped = 1;
}

message PrepareHandoffRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The workflow is not fully clear to me. Is this sent to the current leader, the new leader or both sequencers?

- Design A adds Raft operational overhead (quorum management, snapshots).
- Design B has a smaller blast radius but does not generalize to N replicas; stricter reliance on correct fencing.
- Additional components (sidecars, proxies) increase deployment surface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful for me to have some sequence diagrams that show the flow how the handover works. Happy path first and unhappy path for the edge cases

Copy link

codecov bot commented Aug 29, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 72.30%. Comparing base (f483445) to head (5e0e181).
⚠️ Report is 13 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #2598      +/-   ##
==========================================
- Coverage   72.41%   72.30%   -0.11%     
==========================================
  Files          72       72              
  Lines        7394     7406      +12     
==========================================
+ Hits         5354     5355       +1     
- Misses       1600     1611      +11     
  Partials      440      440              
Flag Coverage Δ
combined 72.30% <ø> (-0.11%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

5 participants