Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/assets/emerald-syncing.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
110 changes: 110 additions & 0 deletions docs/specs/Syncing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,110 @@
# Syncing

## Implementation Overview

### Reth Sync Overview

When a Reth node falls behind other Reth nodes while the consensus client is not advancing, Reth continues to receive new blocks through the P2P networking layer (`crates/net/`). Other peers announce new blocks via `NewBlockHashes` and `NewBlock` messages, which Reth can then download and validate locally.

Reth waits for a command from the consensus client through Engine API method calls before advancing the canonical chain. This ensures that the execution layer remains synchronized with the CL decided values.

### Malachite Sync Overview

[Documentation](https://github.com/informalsystems/malachite/tree/main/specs/synchronization)

When a node is behind on the consensus layer, Malachite triggers the sync protocol.

The peer node receives an `AppMsg::GetDecidedValue` message in the host app (Emerald).

The receiving node processes the message and returns a response, which the requesting node then handles via `AppMsg::ProcessSyncedValue`.

### Emerald Sync Overview

<img src="../assets/emerald-syncing.png" width="800" />

## Sync Request Handling

The sync request contains the height, while the expected response includes the `value_bytes` and the commit `certificate`.

When the middleware (Emerald) receives the `AppMsg::GetDecidedValue` message, it processes it as follows:

1. Retrieve the earliest height from storage.
This represents the earliest height for which the node can provide a payload (block). See the _Minimal Node Height_ section.
2. Validate the requested height range:
- If the requested height is above the node’s current height or below the earliest available height, return `None`.
- Otherwise, continue.
3. Retrieve the earliest unpruned height from storage.
This is the earliest height for which the full block is available locally (no need to query Reth).
4. Fetch block data:
- If the requested height is above the earliest unpruned height, return the decided value directly from storage.
- Otherwise, fetch the missing block data from Reth using the Engine API method `engine_getPayloadBodiesByRange`.

To support this logic, block headers and certificates are stored for all blocks a node can provide to peers. This is necessary because Reth only stores payload bodies and does not include the metadata or consensus data required for full block reconstruction.

## Sync Response Handling

Upon receiving a response from a peer, we get the `height`, `round`, `proposer`, and `value_bytes`.

The response is processed as follows:

1. Reconstruct and validate the block using the Engine API method `engine_newPayload`.
This validation ensures that the provided value is consistent with the execution layer’s rules before passing it back to Malachite.
2. Handle the validation responses:
- If the execution client returns a `SYNCING` status, the node retries validation.
- The retry mechanism resends the validation request until the execution layer returns either `VALID` or `INVALID`.
- After each `SYNCING` response, the system waits for a configurable sleep delay before retrying.

This was added in order to ensure proper sync in scenarios where both the consensus and execution layers are recovering from a crash.

3. Return the reconstructed proposal to Malachite once validation succeeds.

> Note:
> In the current Malachite implementation, there is no timeout during validation of syncing values.
> A configurable syncing timeout has been introduced as part of the `MalakethConfig` to address this.
>

## Minimal Node Height

The minimal node height corresponds to the certificate with the lowest available height, since certificates are stored for all supported values.

Currently, there is no direct way to determine the oldest supported height by the execution client (Reth).

Depending on the node type, behavior differs:

- Archival Node - Reth stores all blocks from genesis. The minimal height can directly reflect this.
- Full Node - Reth stores approximately the last 10,064 blocks, pruning older ones. Logic could be added so that Malachite prunes in parallel with Reth.
- Custom Node - If Reth uses a custom pruning policy (defined in `reth.toml`), the middleware would need to either:
- Follow the same custom pruning rules, or
- Restrict itself to providing only data available locally.

> Note:
> Emerald currently supports only archival Reth nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is no longer true.

Suggested change
> Emerald currently supports only archival Reth nodes.
> In order for a node to be able to sync, there has to be at least one archival node in the network that can provide historical data. We plan to add snapshot syncing to remove this constraint.

>

## Emerald Storage overview

In order to provide a response to the `AppMsg::GetDecidedValue` message for height `h`, a node requires both the appropriate certificate and the corresponding payload. Payload bodies can be retrieved using the Engine API method `engine_getPayloadBodiesByRange`, which returns the transactions and withdrawals contained within a payload but does not include the remaining metadata. The `decided_block_headers` were therefore added to storage to support the syncing protocol.

The `block_header` type is based on `ExecutionPayloadV3`, but with transactions and withdrawals set to empty vectors (code [ref](https://github.com/informalsystems/malaketh-layered-private/blob/main/app/src/state.rs#L524C1-L543C2)).

An alternative approach would have been to use the `eth_getBlockByNumber` method and store `block_number` instead. However, since `engine_getPayloadBodiesByRange` was specifically designed for syncing purpose and allows for future batching optimizations, it was chosen instead.

## Example flow

Consider a scenario where the entire node crashes and falls behind. In this case, Reth will detect from its peers that it is lagging, and Malachite will also trigger its syncing protocol through status exchanges.

On the Malachite side, data needs to be retrieved from its host (in our setup, Emerald + Reth) to provide information to peers. When we receive the `AppMsg::GetDecidedValue` message, several situations are possible:

1. Data is available locally in Emerald - this applies only for the last few heights (5).
2. Metadata is available, but the full decided value is missing - we need to query Reth for the missing data.
3. No data is available at all.

Suppose a situation where metadata is available, but the payload bodies for the corresponding block heights must be retrieved from Reth. In this case, the decided value is reconstructed and returned to Malachite, which then forwards it to the syncing peer.

When the peer receives the decided value, it must validate it via the `engine_newPayload` API call.
If Reth is still syncing and does not yet have the required data for validation, the call will return `PayloadStatus::SYNCING`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add references to the config flags. (their names and default values).

In that case, Malachite will retry until the operation either succeeds or times out. Once Reth returns `Valid` or `Invalid`, the peer can proceed accordingly.

A similar flow occurs when a node is joining the network where emr0 runs with reth0, and emr1 runs with reth1. A new node pair, emr2 and reth2, then intends to join the network.

For example, emr1 (Emerald 1) can provide a decided value to emr2, while reth2 queries its peers to retrieve the missing data it has fallen behind on. If emr2 (Emerald 2) reaches the newPayload call before reth2 has fully synchronized, it must wait until Reth completes syncing.