Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Range sync stuck due to insufficient peers on data column subnets when triggered #6895

Open
jimmygchen opened this issue Jan 31, 2025 · 3 comments
Assignees
Labels
das Data Availability Sampling fulu Required for the upcoming Fulu hard fork peerdas-devnet-4 syncing

Comments

@jimmygchen
Copy link
Member

Issue

When a head chain sync starts, if there isn't any peers on data column subnets, it will not progress and gets stuck unless some external events triggers it to resume (e.g. new peer added).

This scenario is easily reproducible with PeerDAS, because we don't immediately know the peer's custody_group_count until we get a metadata response back from them.

The sequence of events I observed:

  1. Connected to multiple peers with advanced sync state, so we start a new head chain sync (log New chain added to sync, New head chain started syncing)
  2. And immediately, sync thinks there isn't any peers on its custody data column subnets because we haven't obtained their metadata yet and don't know their custody count, therefore no batch gets sent (logWaiting for peers to be available on custody column subnets)
  3. Later, we obtained their metadata (log Obtained peer's metadata), but this doesn't re-trigger range sync, so we're stuck until finalized sync kicks in.

A few possible solutions (not mutually exclusive):

  1. Handle this in range request decoupling properly (Improve range sync with PeerDAS #6258)
  2. Compute peer_info.custody_subnets when peer is connected, and using the minimum custody requirement, as every peer must serve the minimum required column count - this way we're likely to have some peers in data column subnets even before obtaining metadata.
  3. Update sync when we obtain peer metadata, and trigger resume

Additional Info

Range sync currently rely on this check syncing_chain.good_peers_on_sampling_subnets before requesting batches from peer.

  1. if !self.good_peers_on_sampling_subnets(epoch, network) {
  2. } else if !self.good_peers_on_sampling_subnets(self.processing_target, network) {
    // This is to handle the case where no batch was sent for the current processing
    // target when there is no sampling peers available. This is a valid state and should not
    // return an error.
    return Ok(KeepChain);
  3. if !self.good_peers_on_sampling_subnets(self.to_be_downloaded, network) {
    debug!(
    self.log,
    "Waiting for peers to be available on custody column subnets"
    );
    return None;
    }

This is a workaround to avoid sending out excessive block requests because block and data column requests are currently coupled. In the where we request a batch, and there's no peers on the required column subnet, the blocks request will be sent but the data columns by range won't, and will fail with RpcRequestSendError::NoCustodyPeers. This will trigger retry and the node end up sending excessive blocks by range requests to peers without progressing. Longer term solution is to decouple the ByRange requests (#6258).

@jimmygchen jimmygchen added das Data Availability Sampling syncing fulu Required for the upcoming Fulu hard fork labels Jan 31, 2025
@dapplion
Copy link
Collaborator

What if we change the AddPeer event to

    AddPeer(PeerId, SyncInfo, CustodyColumns),

So the peer manager waits to know the peer's columns before emitting it. Then sync doesn't need to rely on the network globals to compute how is custodying what

@jimmygchen
Copy link
Member Author

Currently AddPeer is triggered on a peer status response, and then he event handler update the peer's sync status & sync state. If we wait for metadata response, it delays these from happening - do you see any issues with this?

Or Is it worth adding another event for metadata response, and just re-trigger column requests the same way you're triggering by root requests here:

// Try to make progress on custody requests that are waiting for peers
for (id, result) in self.network.continue_custody_by_root_requests() {
self.on_custody_by_root_result(id, result);
}

Can we also start with the minimal custody requirement before we get metadata response back?

This seems to be the main issue with syncing a new node on devnet-4 right now, and would be good to fix ASAP.

@dapplion
Copy link
Collaborator

If we wait for metadata response, it delays these from happening - do you see any issues with this?

Hmm not really, RPC requests complete very fast and I don't see how this can be a bottleneck

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
das Data Availability Sampling fulu Required for the upcoming Fulu hard fork peerdas-devnet-4 syncing
Projects
None yet
Development

No branches or pull requests

2 participants