-
Notifications
You must be signed in to change notification settings - Fork 219
ADR: HA failover #2598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
ADR: HA failover #2598
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,226 @@ | ||
|
||
# ADR 023: Sequencer Recovery & Liveness — Rafted Conductor vs 1‑Active/1‑Failover | ||
|
||
## Changelog | ||
|
||
- 2025-08-21: Initial ADR authored; compared approaches and captured failover and escape‑hatch semantics. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The term "escape-hatch" is mentioned here in the changelog but isn't defined or used elsewhere in the ADR. To improve clarity, consider replacing it with a term that is described in the document, such as "break-glass overrides" (mentioned in the Security Considerations section), or adding a definition for what an "escape-hatch" entails in this context. |
||
|
||
## Context | ||
|
||
We need a robust, deterministic way to keep L2 block production live when the primary sequencer becomes unhealthy or unreachable, and to **recover leadership** without split‑brain or unsafe reorgs. The solution must integrate cleanly with `ev-node`, be observable, and support zero‑downtime upgrades. This ADR evaluates two designs for the **control plane** that governs which node is allowed to run the sequencer process. | ||
|
||
## Alternative Approaches | ||
|
||
Considered but not chosen for this iteration: | ||
|
||
- **Many replicas, no coordination**: high risk of **simultaneous leaders** (split‑brain) and soft‑confirmation reversals. | ||
- **Full BFT consensus among sequencers**: heavier operational/engineering cost than needed; our fault model is crash‑fault tolerance with honest operators. | ||
- **Outsource ordering to a shared sequencer network**: viable but introduces an external dependency and different SLOs; out of scope for the immediate milestone. | ||
- **Manual failover only**: too slow and error‑prone for production SLOs. | ||
|
||
## Decision | ||
|
||
> We will operate **1 active + 1 failover** sequencer at all times, regardless of control plane. Two implementation options are approved: | ||
|
||
- **Design A — Rafted Conductor (CFT)**: A sidecar *conductor* runs next to each `ev-node`. Conductors form a **Raft** cluster to elect a single leader and **gate** sequencing so only the Raft leader may produce blocks via the Admin Control API. Applicability: use Raft only when there are **≥ 3 sequencers** (prefer odd N: 3, 5, …). Do not use Raft for two-node 1‑active/1‑failover clusters; use Design B in that case. | ||
*Note:* OP Stack uses a very similar pattern for its sequencer; see `op-conductor` in References. | ||
|
||
- **Design B — 1‑Active / 1‑Failover (Lease/Lock)**: One hot standby promotes itself when the active fails by acquiring a **lease/lock** (e.g., Kubernetes Lease or external KV). Strong **fencing** ensures the old leader cannot keep producing after lease loss. | ||
|
||
**Why both assume 1A/1F:** Even with Raft, we intentionally keep **n** nodes on hot standby capable of immediate promotion; additional nodes may exist as **read‑only** or **witness** roles to strengthen quorum without enabling extra leaders. | ||
|
||
Status of this decision: **Proposed** for implementation and test hardening. | ||
|
||
## Detailed Design | ||
|
||
### User requirements | ||
- **No split‑brain**: at most one sequencer is active. | ||
- **Deterministic recovery**: new leader starts from a known **unsafe head**. | ||
- **Fast failover**: p50 ≤ 15s, p95 ≤ 45s. | ||
- **Operational clarity**: health metrics, leader identity, and explicit admin controls. | ||
- **Zero‑downtime upgrades**: blue/green leadership transfer. | ||
|
||
### Systems affected | ||
- `ev-node` (sequencer control hooks, health surface). | ||
- New sidecar(s): **conductor** (Design A) or **lease‑manager** (Design B). | ||
- RPC ingress (optional **leader‑aware proxy** to route sequencing endpoints only to the leader). | ||
- CI/CD & SRE runbooks, dashboards, alerts. | ||
|
||
### New/changed data structures | ||
- **UnsafeHead** record persisted by control plane: `(block_height, bloch_hash, timestamp)`. | ||
- **Design A (Raft)**: replicated **Raft log** entries for `UnsafeHead`, `LeadershipTerm`, and optional `CommitMeta` (batch/DA pointers); periodic snapshots. | ||
- **Design B (Lease)**: a single **Lease** record (Kubernetes Lease or external KV entry) plus a monotonic **lease token** for fencing. | ||
|
||
### Admin Control API (Protobuf) | ||
|
||
We introduce a separate, authenticated Admin Control API dedicated to sequencing control. This API is not exposed on the public RPC endpoint and binds to a distinct listener (port/interface, e.g., `:8443` on an internal network or loopback-only in single-host deployments). It is used exclusively by the conductor/lease-manager and by privileged operator automation for break-glass procedures. | ||
|
||
Service overview: | ||
- StartSequencer: Arms/starts sequencing subject to fencing (valid lease/term) and optionally pins to last persisted UnsafeHead. | ||
- StopSequencer: Hard stop with optional “force” semantics. | ||
- PrepareHandoff / CompleteHandoff: Explicit, auditable, two-phase, blue/green leadership transfer. | ||
- Health / Status: Health probes and machine-readable node + leader state. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We would need some "dead man switch" for the node to refresh the "running" permission, in case the conductor dies before stopping the node. The "start" command can come with a timeout that defines the range for the max refresh interval. |
||
Endpoint separation: | ||
- Public JSON-RPC and P2P endpoints remain unchanged. | ||
- Admin Control API is out-of-band and must not be routed through public ingress. It sits behind mTLS and strict network policy. | ||
|
||
The protobuf file is located in `proto/evnode/admin/v1/control.proto`. | ||
|
||
|
||
Error semantics: | ||
- PERMISSION_DENIED: AuthN/AuthZ failure, missing or invalid mTLS identity. | ||
- FAILED_PRECONDITION: Missing/expired lease or fencing violation; handoff ticket invalid. | ||
- ABORTED: Lost leadership mid-flight; TOCTOU fencing triggered self-stop. | ||
- ALREADY_EXISTS: Start requested but sequencer already active with same term. | ||
- UNAVAILABLE: Local dependencies not ready (DA client, exec engine). | ||
|
||
### Efficiency considerations | ||
- **Design A:** Raft heartbeats and snapshotting add small steady‑state overhead; no impact on throughput when healthy. | ||
- **Design B:** Lease renewals are lightweight; performance dominated by `ev-node` itself. | ||
|
||
### Expected access patterns | ||
- Reads (RPC, state) should work on all nodes; **writes/sequence endpoints** only on the active leader. If a leader‑aware proxy is deployed, it enforces this automatically. | ||
|
||
### Logging/Monitoring/Observability | ||
- Metrics: `leader_id`, `raft_term` (A), `lease_owner` (B), `unsafe_head_advance`, `peer_count`, `rpc_error_rate`, `da_publish_latency`, `backlog`, `leader_election_epoch`, `leader_election_leader_last_seen_ts`, `leader_election_heartbeat_timeout_total`, `leader_election_leader_uptime_ms`. | ||
- Alerts: no unsafe advance > 3× block time; unexpected leader churn; lease lost but sequencer still active (fencing breach). | ||
- Logs: audit all **Start/Stop** decisions and override operations. | ||
|
||
## Diagrams | ||
|
||
This section illustrates the nominal handoff, crash handover, and node join flows. Diagrams use Mermaid for clarity. | ||
|
||
### Planned Leadership Handoff (Prepare → Complete) | ||
|
||
```mermaid | ||
sequenceDiagram | ||
autonumber | ||
participant Op as Operator/Automation | ||
participant L as Leader Node (A) | ||
participant CA as Conductor A | ||
participant F as Target Node (B) | ||
participant CB as Conductor B | ||
|
||
Op->>CA: PrepareHandoff(lease_token, target_id=B) | ||
CA->>L: Quiesce sequencing, persist UnsafeHead | ||
L-->>CA: Ack ready, return UnsafeHead, term | ||
CA-->>Op: handoff_ticket(term, UnsafeHead, target=B) | ||
|
||
note over L,F: Ticket binds term + UnsafeHead + target_id | ||
|
||
Op->>CB: Deliver handoff_ticket to target (B) | ||
CB->>F: CompleteHandoff(handoff_ticket) | ||
CB->>F: StartSequencer(from_unsafe_head=true, lease_token') | ||
F-->>CB: activated=true, term, unsafe | ||
CA->>L: StopSequencer(force=false) | ||
``` | ||
|
||
Key properties: | ||
- Ticket is audience-bound (target_id) and term-bound; replay-safe. | ||
- New leader must resume from the provided `UnsafeHead` to ensure continuity. | ||
- Old leader performs orderly stop after the new leader activates. | ||
|
||
### Crash Handover (Leader loss) | ||
|
||
```mermaid | ||
sequenceDiagram | ||
autonumber | ||
participant A as Old Leader (A) | ||
participant CP as Control Plane (Raft/Lease) | ||
participant B as Candidate Node (B) | ||
|
||
A-x CP: Heartbeats/lease renewals stop | ||
CP->>CP: Term++ (Raft) or Lease expires | ||
B->>CP: Campaign / Acquire Lease | ||
CP-->>B: Leadership granted (term/epoch), mint token | ||
B->>B: Eligibility gate checks (sync, DA/exec ready) | ||
alt Behind or cannot advance | ||
B-->>CP: Decline leadership, remain follower | ||
else Eligible | ||
B->>B: StartSequencer(from_unsafe_head=true, lease_token) | ||
B-->>CP: Becomes active leader for new term | ||
end | ||
``` | ||
|
||
Notes: | ||
- If no candidate passes eligibility, control plane keeps searching or alerts; no split-brain occurs. | ||
- `UnsafeHead` continuity is enforced by token/ticket claims or persisted state. | ||
|
||
### Joining Node Flow (Follower by default) | ||
|
||
```mermaid | ||
flowchart LR | ||
J[Node joins cluster] --> D[Discover term via Raft/Lease; fetch UnsafeHead] | ||
D --> G{Within lag threshold and\nDA/exec readiness met?} | ||
G -- No --> F[Remain follower; replicate state; no sequencing] | ||
F --> O[Observe term; health; catch up] | ||
G -- Yes --> E[Eligible for promotion] | ||
E --> H[Receive handoff_ticket or acquire lease] | ||
H --> S["StartSequencer(from_unsafe_head=true)"] | ||
``` | ||
|
||
Eligibility gate (No-Advance = No-Leader): | ||
- Must be within configurable lag threshold (height/time) relative to `UnsafeHead` or cluster head. | ||
- DA client reachable and healthy; execution engine synced and ready. | ||
- Local error budget acceptable (no recent critical faults). | ||
- If any check fails, node remains a follower and is not allowed to assume leadership. | ||
|
||
|
||
### Security considerations | ||
tac0turtle marked this conversation as resolved.
Show resolved
Hide resolved
|
||
- Lock down **Admin RPC** with mTLS + RBAC; only the sidecar/process account may call Start/Stop. | ||
- Implement **fencing**: leader periodically validates it still holds leadership/lease; otherwise self‑stops. | ||
- Break‑glass overrides must be gated behind separate credentials and produce auditable events. | ||
|
||
### Privacy considerations | ||
- None beyond existing node telemetry; no user data added. | ||
|
||
### Testing plan | ||
- Kill active sequencer → verify failover within SLO; assert **no double leadership**. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For Design A, we should also kill the conductor on the active sequencer so that others conductors can experience a timeout from the conductor leader. |
||
- Partition tests: only Raft majority (A) or lease holder (B) may produce. | ||
- Blue/green: explicit leadership handoff; confirm unsafe head continuity. | ||
- Misconfigured standby → failover should **refuse**; alarms fire. | ||
- Long‑duration outage drills; confirm user‑facing status and catch‑up behavior. | ||
|
||
### Change breakdown | ||
- Phase 1: Implement Admin RPC + health surface in `ev-node`; add sidecar skeletons. | ||
- Phase 2: Integrate Design A (Raft) in a 1 sequencer + 2 failover; build dashboards/runbooks. | ||
- Phase 3: Add Design B (Lease) profile for small/test clusters; share common health logic. | ||
- Phase 4: Game days and SLO validation; finalize SRE playbooks. | ||
|
||
### Release/compatibility | ||
- **Breaking release?** No — Admin RPCs are additive. | ||
|
||
## Status | ||
|
||
Proposed | ||
|
||
## Consequences | ||
|
||
### Positive | ||
- Clear, deterministic leadership with fencing; supports zero‑downtime upgrades. | ||
- Works with `ev-node` via a small, well‑defined Admin RPC. | ||
- Choice of control plane allows right‑sizing ops: Raft for prod; Lease for small/test. | ||
|
||
### Negative | ||
- Design A adds Raft operational overhead (quorum management, snapshots). | ||
- Design B has a smaller blast radius but does not generalize to N replicas; stricter reliance on correct fencing. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Depending on the chosen implementation, the sequencer stack still may possess a single point of failure (e.g kv store) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the failure being the other node is not up to date with the latest state? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was thinking about the external kv store availability. For testing, a local file is fine, but for production (devnet, testnet, mainnet) a chain in HA mode with an external KV store can be just as fault-vulnerable as one running in standard mode. Assuming the external KV store is exposed via TCP, high availability must cover:
If the operator fails to provide proper HA for any of these components, the sequencer stack still has a single point of failure and is not truly HA, even if the ev-node is running in HA mode. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the kv store will always be local to the node, we dont support adding remote kv stores (dbs) |
||
- Additional components (sidecars, proxies) increase deployment surface. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It would be helpful for me to have some sequence diagrams that show the flow how the handover works. Happy path first and unhappy path for the edge cases |
||
### Neutral | ||
- Small steady‑state CPU/network overhead for heartbeats/leases; negligible compared to sequencing and DA posting. | ||
|
||
## References | ||
|
||
- **OP conductor** (industry prior art; similar to Design A): | ||
- Docs: https://docs.optimism.io/operators/chain-operators/tools/op-conductor | ||
- README: https://github.com/ethereum-optimism/optimism/blob/develop/op-conductor/README.md | ||
|
||
- **`ev-node`** (architecture, sequencing): | ||
- Repo: https://github.com/evstack/ev-node | ||
- Quick start: https://ev.xyz/guides/quick-start | ||
- Discussions/issues on sequencing API & multi-sequencer behavior. | ||
|
||
- **Lease-based leader election**: | ||
- Kubernetes Lease API: https://kubernetes.io/docs/concepts/architecture/leases/ | ||
- client-go leader election helpers: https://pkg.go.dev/k8s.io/client-go/tools/leaderelection |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,104 @@ | ||
syntax = "proto3"; | ||
|
||
package evnode.admin.v1; | ||
|
||
option go_package = "github.com/evstack/ev-node/types/pb/evnode/admin/v1;adminv1"; | ||
|
||
// ControlService governs sequencer lifecycle and health surfaces. | ||
// All operations must be authenticated via mTLS and authorized via RBAC. | ||
service ControlService { | ||
// StartSequencer starts sequencing if and only if the caller holds leadership/fencing. | ||
rpc StartSequencer(StartSequencerRequest) returns (StartSequencerResponse); | ||
|
||
// StopSequencer stops sequencing. If force=true, cancels in-flight loops ASAP. | ||
rpc StopSequencer(StopSequencerRequest) returns (StopSequencerResponse); | ||
|
||
// PrepareHandoff transitions current leader to a safe ready-to-yield state | ||
// and issues a handoff ticket bound to the current term/unsafe head. | ||
rpc PrepareHandoff(PrepareHandoffRequest) returns (PrepareHandoffResponse); | ||
|
||
// CompleteHandoff is called by the target node to atomically assume leadership | ||
// using the handoff ticket. Enforces fencing and continuity from UnsafeHead. | ||
rpc CompleteHandoff(CompleteHandoffRequest) returns (CompleteHandoffResponse); | ||
|
||
// Health returns node-local liveness and recent errors. | ||
rpc Health(HealthRequest) returns (HealthResponse); | ||
|
||
// Status returns leader/term, active/standby, and build info. | ||
rpc Status(StatusRequest) returns (StatusResponse); | ||
} | ||
|
||
message UnsafeHead { | ||
uint64 block_height = 1; | ||
bytes block_hash = 2; // 32 bytes | ||
int64 timestamp = 3; // unix seconds | ||
} | ||
|
||
message LeadershipTerm { | ||
uint64 term = 1; // monotonic term/epoch for fencing, indicates the current term | ||
string leader_id = 2; // conductor/node ID | ||
} | ||
|
||
message StartSequencerRequest { | ||
bool from_unsafe_head = 1; // if false, uses safe head per policy | ||
bytes lease_token = 2; // opaque, issued by control plane (Raft/Lease) | ||
string reason = 3; // audit string | ||
string requester = 4; // principal for audit | ||
} | ||
message StartSequencerResponse { | ||
bool activated = 1; | ||
LeadershipTerm term = 2; | ||
UnsafeHead unsafe = 3; | ||
} | ||
|
||
message StopSequencerRequest { | ||
bytes lease_token = 1; | ||
bool force = 2; | ||
string reason = 3; | ||
string requester = 4; | ||
} | ||
message StopSequencerResponse { | ||
bool stopped = 1; | ||
} | ||
|
||
message PrepareHandoffRequest { | ||
bytes lease_token = 1; | ||
string target_id = 2; // logical target node ID | ||
string reason = 3; | ||
string requester = 4; | ||
} | ||
message PrepareHandoffResponse { | ||
bytes handoff_ticket = 1; // opaque, bound to term+unsafe head | ||
LeadershipTerm term = 2; | ||
UnsafeHead unsafe = 3; | ||
} | ||
|
||
message CompleteHandoffRequest { | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can there be multiple hand-off process in-flight? what happens if the hand-off does not complete? Is there a timeout to consider? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. the handoff paths are manually triggered so it a "controlled environment". It will be treated as FCFS meaning that if one is inflight others are rejected, if there handoff is not completed within the timeout then it will fallback to another if present. Ill expand on that in the adr |
||
bytes handoff_ticket = 1; | ||
string requester = 2; | ||
string idempotency_key = 3; | ||
} | ||
message CompleteHandoffResponse { | ||
bool activated = 1; | ||
LeadershipTerm term = 2; | ||
UnsafeHead unsafe = 3; | ||
} | ||
|
||
message HealthRequest {} | ||
message HealthResponse { | ||
bool healthy = 1; | ||
uint64 block_height = 2; | ||
bytes block_hash = 3; | ||
uint64 peer_count = 4; | ||
uint64 da_height = 5; | ||
string last_err = 6; | ||
} | ||
|
||
message StatusRequest {} | ||
message StatusResponse { | ||
bool sequencer_active = 1; | ||
string build_version = 2; | ||
string leader_hint = 3; // optional, human-readable | ||
string last_err = 4; | ||
LeadershipTerm term = 5; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current title "Rafted Conductor vs 1‑Active/1‑Failover" could be slightly confusing because the document explains that both proposed designs (Rafted Conductor and Lease/Lock) implement a "1-Active/1-Failover" strategy. To improve clarity, consider retitling to focus on the two mechanisms being compared, for example:
Sequencer Recovery & Liveness: Rafted Conductor vs. Lease/Lock
.