Describe the feature
Motivation
I run reth as an RPC node behind K8s, and I need a readiness signal that's separate from liveness: not "is the process up" but "is this node caught up enough to return correct results".
The case that bites: a node that just restarted (or is still catching up) has its HTTP server up and answers requests fine, but it's behind the head. If it stays in the service rotation, clients get stale reads, e.g., eth_getTransactionByHash returns null for a tx the rest of the network already has. The usual solution is a readiness probe that pulls the node out of rotation until it catches up, without restarting it.
Right now Reth exposes the JSON-RPC API and the Prometheus metrics endpoint, but there's nothing a probe can hit for a plain 200/503 readiness answer. So everyone ends up running a small sidecar that calls eth_getBlockByNumber("latest"), compares the block timestamp to wall clock, and maps that to a status code. It works, but it's the same glue rewritten by every operator, and metrics aren't really the right thing to gate traffic on.
Additional context
Proposal
An optional, unauthenticated HTTP endpoint (/health or /livez + /readyz) that returns 200 when ready and 503 otherwise, with a short body saying which check failed.
The part I'd want to get right is keeping the freshness threshold operator-configurable rather than hardcoded, since "how stale is too stale" depends on the chain (L1 ~12s, OP Stack ~2s, and so on). Either approach works:
header-driven, Erigon-style: the caller passes thresholds per request (max_seconds_behind, min_peer_count, ...)
flag-driven: thresholds set once at startup
Checks I'd find useful (all optional):
max_seconds_behind: age of the latest block timestamp vs now. This is the main freshness signal. One thing to get right: read the timestamp via the normal eth_getBlockByNumber("latest") path, not an internal/snapshot source, to avoid the 0-timestamp bug Erigon hit (see below).
min_peer_count
a trivial process-liveness check for /livez
Prior art
Erigon does this with X-ERIGON-HEALTHCHECK headers (max_seconds_behind, min_peer_count, synced, check_block). It also has a known bug, erigontech/erigon#9357, where the check read the timestamp from an internal source and got 0, which is why I mention reading it from the normal path above.
ChainSafe Forest added /livez and /readyz with a ?verbose flag that lists each check: ChainSafe/forest#3949.
Where it should live
I think this belongs in core reth's RPC layer as a generic, chain-agnostic feature, so downstreams on the Reth SDK (op-reth included) pick it up through the normal port process instead of carrying a separate patch. Any downstream-specific wiring, like exposing an enable flag in op-reth's CLI, can be added there at port time; the core only gets written once.
One limitation to call out: for rollups the EL can't know on its own how far it is behind the sequencer's unsafe head, since that lives in the consensus layer (op-node's optimism_syncStatus). So this endpoint can cover block-age and peer-count freshness, but a full "caught up to the sequencer" check is out of scope here.
Scope
generic and chain-agnostic in core reth, no chain-specific freshness constants
thresholds operator-configurable, not opinionated defaults that gate traffic
No consensus layer / sequencer-head state in core reth
rollup-specific cross-checks (e.g. optimism_syncStatus) stay downstream
Questions
Is there appetite for this in core Reth, or do you consider readiness gating an operator-side (sidecar) concern?
If there's interest: single /health with header-driven checks, or /livez + /readyz with startup-configured thresholds?
Where in the RPC layer should it sit, and should it be behind its flag or port (e.g., --http.health)?
Happy to implement it if there's interest and a rough agreement on the shape.
Describe the feature
Motivation
I run reth as an RPC node behind K8s, and I need a readiness signal that's separate from liveness: not "is the process up" but "is this node caught up enough to return correct results".
The case that bites: a node that just restarted (or is still catching up) has its HTTP server up and answers requests fine, but it's behind the head. If it stays in the service rotation, clients get stale reads, e.g., eth_getTransactionByHash returns null for a tx the rest of the network already has. The usual solution is a readiness probe that pulls the node out of rotation until it catches up, without restarting it.
Right now Reth exposes the JSON-RPC API and the Prometheus metrics endpoint, but there's nothing a probe can hit for a plain 200/503 readiness answer. So everyone ends up running a small sidecar that calls eth_getBlockByNumber("latest"), compares the block timestamp to wall clock, and maps that to a status code. It works, but it's the same glue rewritten by every operator, and metrics aren't really the right thing to gate traffic on.
Additional context
Proposal
An optional, unauthenticated HTTP endpoint (/health or /livez + /readyz) that returns 200 when ready and 503 otherwise, with a short body saying which check failed.
The part I'd want to get right is keeping the freshness threshold operator-configurable rather than hardcoded, since "how stale is too stale" depends on the chain (L1 ~12s, OP Stack ~2s, and so on). Either approach works:
header-driven, Erigon-style: the caller passes thresholds per request (max_seconds_behind, min_peer_count, ...)
flag-driven: thresholds set once at startup
Checks I'd find useful (all optional):
max_seconds_behind: age of the latest block timestamp vs now. This is the main freshness signal. One thing to get right: read the timestamp via the normal eth_getBlockByNumber("latest") path, not an internal/snapshot source, to avoid the 0-timestamp bug Erigon hit (see below).
min_peer_count
a trivial process-liveness check for /livez
Prior art
Erigon does this with X-ERIGON-HEALTHCHECK headers (max_seconds_behind, min_peer_count, synced, check_block). It also has a known bug, erigontech/erigon#9357, where the check read the timestamp from an internal source and got 0, which is why I mention reading it from the normal path above.
ChainSafe Forest added /livez and /readyz with a ?verbose flag that lists each check: ChainSafe/forest#3949.
Where it should live
I think this belongs in core reth's RPC layer as a generic, chain-agnostic feature, so downstreams on the Reth SDK (op-reth included) pick it up through the normal port process instead of carrying a separate patch. Any downstream-specific wiring, like exposing an enable flag in op-reth's CLI, can be added there at port time; the core only gets written once.
One limitation to call out: for rollups the EL can't know on its own how far it is behind the sequencer's unsafe head, since that lives in the consensus layer (op-node's optimism_syncStatus). So this endpoint can cover block-age and peer-count freshness, but a full "caught up to the sequencer" check is out of scope here.
Scope
generic and chain-agnostic in core reth, no chain-specific freshness constants
thresholds operator-configurable, not opinionated defaults that gate traffic
No consensus layer / sequencer-head state in core reth
rollup-specific cross-checks (e.g. optimism_syncStatus) stay downstream
Questions
Is there appetite for this in core Reth, or do you consider readiness gating an operator-side (sidecar) concern?
If there's interest: single /health with header-driven checks, or /livez + /readyz with startup-configured thresholds?
Where in the RPC layer should it sit, and should it be behind its flag or port (e.g., --http.health)?
Happy to implement it if there's interest and a rough agreement on the shape.