Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
226 changes: 226 additions & 0 deletions docs/adr/ADR-023-sequencer-recovery.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,226 @@

# ADR 023: Sequencer Recovery & Liveness — Rafted Conductor vs 1‑Active/1‑Failover
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current title "Rafted Conductor vs 1‑Active/1‑Failover" could be slightly confusing because the document explains that both proposed designs (Rafted Conductor and Lease/Lock) implement a "1-Active/1-Failover" strategy. To improve clarity, consider retitling to focus on the two mechanisms being compared, for example: Sequencer Recovery & Liveness: Rafted Conductor vs. Lease/Lock.


## Changelog

- 2025-08-21: Initial ADR authored; compared approaches and captured failover and escape‑hatch semantics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The term "escape-hatch" is mentioned here in the changelog but isn't defined or used elsewhere in the ADR. To improve clarity, consider replacing it with a term that is described in the document, such as "break-glass overrides" (mentioned in the Security Considerations section), or adding a definition for what an "escape-hatch" entails in this context.


## Context

We need a robust, deterministic way to keep L2 block production live when the primary sequencer becomes unhealthy or unreachable, and to **recover leadership** without split‑brain or unsafe reorgs. The solution must integrate cleanly with `ev-node`, be observable, and support zero‑downtime upgrades. This ADR evaluates two designs for the **control plane** that governs which node is allowed to run the sequencer process.

## Alternative Approaches

Considered but not chosen for this iteration:

- **Many replicas, no coordination**: high risk of **simultaneous leaders** (split‑brain) and soft‑confirmation reversals.
- **Full BFT consensus among sequencers**: heavier operational/engineering cost than needed; our fault model is crash‑fault tolerance with honest operators.
- **Outsource ordering to a shared sequencer network**: viable but introduces an external dependency and different SLOs; out of scope for the immediate milestone.
- **Manual failover only**: too slow and error‑prone for production SLOs.

## Decision

> We will operate **1 active + 1 failover** sequencer at all times, regardless of control plane. Two implementation options are approved:

- **Design A — Rafted Conductor (CFT)**: A sidecar *conductor* runs next to each `ev-node`. Conductors form a **Raft** cluster to elect a single leader and **gate** sequencing so only the Raft leader may produce blocks via the Admin Control API. Applicability: use Raft only when there are **≥ 3 sequencers** (prefer odd N: 3, 5, …). Do not use Raft for two-node 1‑active/1‑failover clusters; use Design B in that case.
*Note:* OP Stack uses a very similar pattern for its sequencer; see `op-conductor` in References.

- **Design B — 1‑Active / 1‑Failover (Lease/Lock)**: One hot standby promotes itself when the active fails by acquiring a **lease/lock** (e.g., Kubernetes Lease or external KV). Strong **fencing** ensures the old leader cannot keep producing after lease loss.

**Why both assume 1A/1F:** Even with Raft, we intentionally keep **n** nodes on hot standby capable of immediate promotion; additional nodes may exist as **read‑only** or **witness** roles to strengthen quorum without enabling extra leaders.

Status of this decision: **Proposed** for implementation and test hardening.

## Detailed Design

### User requirements
- **No split‑brain**: at most one sequencer is active.
- **Deterministic recovery**: new leader starts from a known **unsafe head**.
- **Fast failover**: p50 ≤ 15s, p95 ≤ 45s.
- **Operational clarity**: health metrics, leader identity, and explicit admin controls.
- **Zero‑downtime upgrades**: blue/green leadership transfer.

### Systems affected
- `ev-node` (sequencer control hooks, health surface).
- New sidecar(s): **conductor** (Design A) or **lease‑manager** (Design B).
- RPC ingress (optional **leader‑aware proxy** to route sequencing endpoints only to the leader).
- CI/CD & SRE runbooks, dashboards, alerts.

### New/changed data structures
- **UnsafeHead** record persisted by control plane: `(block_height, bloch_hash, timestamp)`.
- **Design A (Raft)**: replicated **Raft log** entries for `UnsafeHead`, `LeadershipTerm`, and optional `CommitMeta` (batch/DA pointers); periodic snapshots.
- **Design B (Lease)**: a single **Lease** record (Kubernetes Lease or external KV entry) plus a monotonic **lease token** for fencing.

### Admin Control API (Protobuf)

We introduce a separate, authenticated Admin Control API dedicated to sequencing control. This API is not exposed on the public RPC endpoint and binds to a distinct listener (port/interface, e.g., `:8443` on an internal network or loopback-only in single-host deployments). It is used exclusively by the conductor/lease-manager and by privileged operator automation for break-glass procedures.

Service overview:
- StartSequencer: Arms/starts sequencing subject to fencing (valid lease/term) and optionally pins to last persisted UnsafeHead.
- StopSequencer: Hard stop with optional “force” semantics.
- PrepareHandoff / CompleteHandoff: Explicit, auditable, two-phase, blue/green leadership transfer.
- Health / Status: Health probes and machine-readable node + leader state.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need some "dead man switch" for the node to refresh the "running" permission, in case the conductor dies before stopping the node. The "start" command can come with a timeout that defines the range for the max refresh interval.

Endpoint separation:
- Public JSON-RPC and P2P endpoints remain unchanged.
- Admin Control API is out-of-band and must not be routed through public ingress. It sits behind mTLS and strict network policy.

The protobuf file is located in `proto/evnode/admin/v1/control.proto`.


Error semantics:
- PERMISSION_DENIED: AuthN/AuthZ failure, missing or invalid mTLS identity.
- FAILED_PRECONDITION: Missing/expired lease or fencing violation; handoff ticket invalid.
- ABORTED: Lost leadership mid-flight; TOCTOU fencing triggered self-stop.
- ALREADY_EXISTS: Start requested but sequencer already active with same term.
- UNAVAILABLE: Local dependencies not ready (DA client, exec engine).

### Efficiency considerations
- **Design A:** Raft heartbeats and snapshotting add small steady‑state overhead; no impact on throughput when healthy.
- **Design B:** Lease renewals are lightweight; performance dominated by `ev-node` itself.

### Expected access patterns
- Reads (RPC, state) should work on all nodes; **writes/sequence endpoints** only on the active leader. If a leader‑aware proxy is deployed, it enforces this automatically.

### Logging/Monitoring/Observability
- Metrics: `leader_id`, `raft_term` (A), `lease_owner` (B), `unsafe_head_advance`, `peer_count`, `rpc_error_rate`, `da_publish_latency`, `backlog`, `leader_election_epoch`, `leader_election_leader_last_seen_ts`, `leader_election_heartbeat_timeout_total`, `leader_election_leader_uptime_ms`.
- Alerts: no unsafe advance > 3× block time; unexpected leader churn; lease lost but sequencer still active (fencing breach).
- Logs: audit all **Start/Stop** decisions and override operations.

## Diagrams

This section illustrates the nominal handoff, crash handover, and node join flows. Diagrams use Mermaid for clarity.

### Planned Leadership Handoff (Prepare → Complete)

```mermaid
sequenceDiagram
autonumber
participant Op as Operator/Automation
participant L as Leader Node (A)
participant CA as Conductor A
participant F as Target Node (B)
participant CB as Conductor B

Op->>CA: PrepareHandoff(lease_token, target_id=B)
CA->>L: Quiesce sequencing, persist UnsafeHead
L-->>CA: Ack ready, return UnsafeHead, term
CA-->>Op: handoff_ticket(term, UnsafeHead, target=B)

note over L,F: Ticket binds term + UnsafeHead + target_id

Op->>CB: Deliver handoff_ticket to target (B)
CB->>F: CompleteHandoff(handoff_ticket)
CB->>F: StartSequencer(from_unsafe_head=true, lease_token')
F-->>CB: activated=true, term, unsafe
CA->>L: StopSequencer(force=false)
```

Key properties:
- Ticket is audience-bound (target_id) and term-bound; replay-safe.
- New leader must resume from the provided `UnsafeHead` to ensure continuity.
- Old leader performs orderly stop after the new leader activates.

### Crash Handover (Leader loss)

```mermaid
sequenceDiagram
autonumber
participant A as Old Leader (A)
participant CP as Control Plane (Raft/Lease)
participant B as Candidate Node (B)

A-x CP: Heartbeats/lease renewals stop
CP->>CP: Term++ (Raft) or Lease expires
B->>CP: Campaign / Acquire Lease
CP-->>B: Leadership granted (term/epoch), mint token
B->>B: Eligibility gate checks (sync, DA/exec ready)
alt Behind or cannot advance
B-->>CP: Decline leadership, remain follower
else Eligible
B->>B: StartSequencer(from_unsafe_head=true, lease_token)
B-->>CP: Becomes active leader for new term
end
```

Notes:
- If no candidate passes eligibility, control plane keeps searching or alerts; no split-brain occurs.
- `UnsafeHead` continuity is enforced by token/ticket claims or persisted state.

### Joining Node Flow (Follower by default)

```mermaid
flowchart LR
J[Node joins cluster] --> D[Discover term via Raft/Lease; fetch UnsafeHead]
D --> G{Within lag threshold and\nDA/exec readiness met?}
G -- No --> F[Remain follower; replicate state; no sequencing]
F --> O[Observe term; health; catch up]
G -- Yes --> E[Eligible for promotion]
E --> H[Receive handoff_ticket or acquire lease]
H --> S["StartSequencer(from_unsafe_head=true)"]
```

Eligibility gate (No-Advance = No-Leader):
- Must be within configurable lag threshold (height/time) relative to `UnsafeHead` or cluster head.
- DA client reachable and healthy; execution engine synced and ready.
- Local error budget acceptable (no recent critical faults).
- If any check fails, node remains a follower and is not allowed to assume leadership.


### Security considerations
- Lock down **Admin RPC** with mTLS + RBAC; only the sidecar/process account may call Start/Stop.
- Implement **fencing**: leader periodically validates it still holds leadership/lease; otherwise self‑stops.
- Break‑glass overrides must be gated behind separate credentials and produce auditable events.

### Privacy considerations
- None beyond existing node telemetry; no user data added.

### Testing plan
- Kill active sequencer → verify failover within SLO; assert **no double leadership**.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Design A, we should also kill the conductor on the active sequencer so that others conductors can experience a timeout from the conductor leader.

- Partition tests: only Raft majority (A) or lease holder (B) may produce.
- Blue/green: explicit leadership handoff; confirm unsafe head continuity.
- Misconfigured standby → failover should **refuse**; alarms fire.
- Long‑duration outage drills; confirm user‑facing status and catch‑up behavior.

### Change breakdown
- Phase 1: Implement Admin RPC + health surface in `ev-node`; add sidecar skeletons.
- Phase 2: Integrate Design A (Raft) in a 1 sequencer + 2 failover; build dashboards/runbooks.
- Phase 3: Add Design B (Lease) profile for small/test clusters; share common health logic.
- Phase 4: Game days and SLO validation; finalize SRE playbooks.

### Release/compatibility
- **Breaking release?** No — Admin RPCs are additive.

## Status

Proposed

## Consequences

### Positive
- Clear, deterministic leadership with fencing; supports zero‑downtime upgrades.
- Works with `ev-node` via a small, well‑defined Admin RPC.
- Choice of control plane allows right‑sizing ops: Raft for prod; Lease for small/test.

### Negative
- Design A adds Raft operational overhead (quorum management, snapshots).
- Design B has a smaller blast radius but does not generalize to N replicas; stricter reliance on correct fencing.
Copy link
Contributor

@auricom auricom Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depending on the chosen implementation, the sequencer stack still may possess a single point of failure (e.g kv store)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the failure being the other node is not up to date with the latest state?

Copy link
Contributor

@auricom auricom Aug 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about the external kv store availability.

For testing, a local file is fine, but for production (devnet, testnet, mainnet) a chain in HA mode with an external KV store can be just as fault-vulnerable as one running in standard mode.

Assuming the external KV store is exposed via TCP, high availability must cover:

  • DNS resolution
  • the KV store service itself
  • a load balancer in front of the KV store

If the operator fails to provide proper HA for any of these components, the sequencer stack still has a single point of failure and is not truly HA, even if the ev-node is running in HA mode.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the kv store will always be local to the node, we dont support adding remote kv stores (dbs)

- Additional components (sidecars, proxies) increase deployment surface.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be helpful for me to have some sequence diagrams that show the flow how the handover works. Happy path first and unhappy path for the edge cases

### Neutral
- Small steady‑state CPU/network overhead for heartbeats/leases; negligible compared to sequencing and DA posting.

## References

- **OP conductor** (industry prior art; similar to Design A):
- Docs: https://docs.optimism.io/operators/chain-operators/tools/op-conductor
- README: https://github.com/ethereum-optimism/optimism/blob/develop/op-conductor/README.md

- **`ev-node`** (architecture, sequencing):
- Repo: https://github.com/evstack/ev-node
- Quick start: https://ev.xyz/guides/quick-start
- Discussions/issues on sequencing API & multi-sequencer behavior.

- **Lease-based leader election**:
- Kubernetes Lease API: https://kubernetes.io/docs/concepts/architecture/leases/
- client-go leader election helpers: https://pkg.go.dev/k8s.io/client-go/tools/leaderelection
104 changes: 104 additions & 0 deletions proto/evnode/admin/v1/control.proto
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
syntax = "proto3";

package evnode.admin.v1;

option go_package = "github.com/evstack/ev-node/types/pb/evnode/admin/v1;adminv1";

// ControlService governs sequencer lifecycle and health surfaces.
// All operations must be authenticated via mTLS and authorized via RBAC.
service ControlService {
// StartSequencer starts sequencing if and only if the caller holds leadership/fencing.
rpc StartSequencer(StartSequencerRequest) returns (StartSequencerResponse);

// StopSequencer stops sequencing. If force=true, cancels in-flight loops ASAP.
rpc StopSequencer(StopSequencerRequest) returns (StopSequencerResponse);

// PrepareHandoff transitions current leader to a safe ready-to-yield state
// and issues a handoff ticket bound to the current term/unsafe head.
rpc PrepareHandoff(PrepareHandoffRequest) returns (PrepareHandoffResponse);

// CompleteHandoff is called by the target node to atomically assume leadership
// using the handoff ticket. Enforces fencing and continuity from UnsafeHead.
rpc CompleteHandoff(CompleteHandoffRequest) returns (CompleteHandoffResponse);

// Health returns node-local liveness and recent errors.
rpc Health(HealthRequest) returns (HealthResponse);

// Status returns leader/term, active/standby, and build info.
rpc Status(StatusRequest) returns (StatusResponse);
}

message UnsafeHead {
uint64 block_height = 1;
bytes block_hash = 2; // 32 bytes
int64 timestamp = 3; // unix seconds
}

Check failure on line 35 in proto/evnode/admin/v1/control.proto

View workflow job for this annotation

GitHub Actions / proto / buf-check

Message "UnsafeHead" should have a non-empty comment for documentation.

message LeadershipTerm {
uint64 term = 1; // monotonic term/epoch for fencing, indicates the current term
string leader_id = 2; // conductor/node ID
}

Check failure on line 40 in proto/evnode/admin/v1/control.proto

View workflow job for this annotation

GitHub Actions / proto / buf-check

Message "LeadershipTerm" should have a non-empty comment for documentation.

message StartSequencerRequest {
bool from_unsafe_head = 1; // if false, uses safe head per policy
bytes lease_token = 2; // opaque, issued by control plane (Raft/Lease)
string reason = 3; // audit string
string requester = 4; // principal for audit
}

Check failure on line 47 in proto/evnode/admin/v1/control.proto

View workflow job for this annotation

GitHub Actions / proto / buf-check

Message "StartSequencerRequest" should have a non-empty comment for documentation.
message StartSequencerResponse {
bool activated = 1;
LeadershipTerm term = 2;
UnsafeHead unsafe = 3;
}

Check failure on line 52 in proto/evnode/admin/v1/control.proto

View workflow job for this annotation

GitHub Actions / proto / buf-check

Message "StartSequencerResponse" should have a non-empty comment for documentation.

message StopSequencerRequest {
bytes lease_token = 1;
bool force = 2;
string reason = 3;
string requester = 4;
}

Check failure on line 59 in proto/evnode/admin/v1/control.proto

View workflow job for this annotation

GitHub Actions / proto / buf-check

Message "StopSequencerRequest" should have a non-empty comment for documentation.
message StopSequencerResponse {
bool stopped = 1;
}

Check failure on line 62 in proto/evnode/admin/v1/control.proto

View workflow job for this annotation

GitHub Actions / proto / buf-check

Message "StopSequencerResponse" should have a non-empty comment for documentation.

message PrepareHandoffRequest {
bytes lease_token = 1;
string target_id = 2; // logical target node ID
string reason = 3;
string requester = 4;
}

Check failure on line 69 in proto/evnode/admin/v1/control.proto

View workflow job for this annotation

GitHub Actions / proto / buf-check

Message "PrepareHandoffRequest" should have a non-empty comment for documentation.
message PrepareHandoffResponse {
bytes handoff_ticket = 1; // opaque, bound to term+unsafe head
LeadershipTerm term = 2;
UnsafeHead unsafe = 3;
}

Check failure on line 74 in proto/evnode/admin/v1/control.proto

View workflow job for this annotation

GitHub Actions / proto / buf-check

Message "PrepareHandoffResponse" should have a non-empty comment for documentation.

message CompleteHandoffRequest {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can there be multiple hand-off process in-flight? what happens if the hand-off does not complete? Is there a timeout to consider?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the handoff paths are manually triggered so it a "controlled environment". It will be treated as FCFS meaning that if one is inflight others are rejected, if there handoff is not completed within the timeout then it will fallback to another if present. Ill expand on that in the adr

bytes handoff_ticket = 1;
string requester = 2;
string idempotency_key = 3;
}

Check failure on line 80 in proto/evnode/admin/v1/control.proto

View workflow job for this annotation

GitHub Actions / proto / buf-check

Message "CompleteHandoffRequest" should have a non-empty comment for documentation.
message CompleteHandoffResponse {
bool activated = 1;
LeadershipTerm term = 2;
UnsafeHead unsafe = 3;
}

Check failure on line 85 in proto/evnode/admin/v1/control.proto

View workflow job for this annotation

GitHub Actions / proto / buf-check

Message "CompleteHandoffResponse" should have a non-empty comment for documentation.

message HealthRequest {}
message HealthResponse {
bool healthy = 1;
uint64 block_height = 2;
bytes block_hash = 3;
uint64 peer_count = 4;
uint64 da_height = 5;
string last_err = 6;
}

message StatusRequest {}
message StatusResponse {
bool sequencer_active = 1;
string build_version = 2;
string leader_hint = 3; // optional, human-readable
string last_err = 4;
LeadershipTerm term = 5;
}
Loading