Range sync stuck due to insufficient peers on data column subnets when triggered

## Issue

When a head chain sync starts, if there isn't any peers on data column subnets, it will not progress and gets stuck unless some external events triggers it to resume (e.g. new peer added).

This scenario is easily reproducible with PeerDAS, because we don't immediately know the peer's `custody_group_count` until we get a metadata response back from them.

The sequence of events I observed:
1. Connected to multiple peers with advanced sync state, so we start a new head chain sync (log `New chain added to sync`, `New head chain started syncing`)
2. And immediately, sync thinks there isn't any peers on its custody data column subnets because we haven't obtained their metadata yet and don't know their custody count, therefore no batch gets sent (log`Waiting for peers to be available on custody column subnets`)
3. Later, we obtained their metadata (log `Obtained peer's metadata`), but this doesn't re-trigger range sync, so we're stuck until finalized sync kicks in.

A few possible solutions (not mutually exclusive):
1. Handle this in range request decoupling properly (#6258)
2. Compute `peer_info.custody_subnets` when peer is connected, and using the minimum custody requirement, as every peer must serve the minimum required column count - this way we're likely to have some peers in data column subnets even before obtaining metadata.
3. Update sync when we obtain peer metadata, and trigger resume

## Additional Info

Range sync currently rely on this check `syncing_chain.good_peers_on_sampling_subnets` before requesting batches from peer.
1.  https://github.com/sigp/lighthouse/blob/7d54a43243905b62e1ced8f56cd5ad0575b8638b/beacon_node/network/src/sync/range_sync/chain.rs#L1075
2. https://github.com/sigp/lighthouse/blob/7d54a43243905b62e1ced8f56cd5ad0575b8638b/beacon_node/network/src/sync/range_sync/chain.rs#L444-L448
3.  https://github.com/sigp/lighthouse/blob/7d54a43243905b62e1ced8f56cd5ad0575b8638b/beacon_node/network/src/sync/range_sync/chain.rs#L1175-L1181

This is a workaround to avoid sending out excessive block requests because block and data column requests are currently coupled. In the where we request a batch, and there's no peers on the required column subnet, the blocks request will be sent but the data columns by range won't, and will fail with [`RpcRequestSendError::NoCustodyPeers`](https://github.com/sigp/lighthouse/blob/70194dfc6a3f4d10c9059610f889ff5a4e863a6a/beacon_node/network/src/sync/network_context.rs#L488-L495). This will trigger retry and the node end up sending excessive blocks by range requests to peers without progressing. Longer term solution is to decouple the `ByRange` requests (#6258).


	} else if !self.good_peers_on_sampling_subnets(self.processing_target, network) {
	// This is to handle the case where no batch was sent for the current processing
	// target when there is no sampling peers available. This is a valid state and should not
	// return an error.
	return Ok(KeepChain);

	if !self.good_peers_on_sampling_subnets(self.to_be_downloaded, network) {
	debug!(
	self.log,
	"Waiting for peers to be available on custody column subnets"
	);
	return None;
	}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Range sync stuck due to insufficient peers on data column subnets when triggered #6895

Issue

Additional Info

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Uh oh!

Range sync stuck due to insufficient peers on data column subnets when triggered #6895

Description

Issue

Additional Info

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions