Bound retry loops and detect reorgs to prevent sync stalls#1491
Open
Bound retry loops and detect reorgs to prevent sync stalls#1491
Conversation
Map Avalanche "cannot query unfinalized data" RPC errors to ErrBlockNotFound instead of swallowing them. This prevents failed trace calls from being treated as successful empty results, which caused trace length mismatches during sync.
90b63c8 to
6020aa5
Compare
Refactor the sync coordinator and workers to handle persistent backend failures and chain reorgs more robustly. Previously, the sync process could get stuck in infinite retry loops if the chain tip moved or a block became temporarily unavailable. - Abstract sync logic into helpers (getBlockHashForSync, waitForBlockWorkers) to reduce duplication between parallel and bulk sync modes. - Implement deadlock prevention by using non-blocking selects on channel sends to workers. - Add reorg detection that triggers a sync restart (errResync) when the requested height is found to be past the current chain tip. - Enhance signal handling to ensure clean exits during worker synchronization.
6020aa5 to
800626e
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
This PR hardens the DB sync coordinator/worker logic to avoid sync stalls caused by unbounded retries when the chain tip moves (reorg/rollback) or the backend temporarily can’t serve a block/hash.
Changes:
- Added bounded retry + “tip fell below requested height” detection in
GetBlockHashcoordination to triggererrResyncinstead of retrying indefinitely. - Refactored worker-coordination error handling via
recordBlockWorkerAbortandwaitForBlockWorkersto prevent coordinator deadlocks and improve shutdown behavior. - Added/expanded regression tests for reorg detection, worker abort propagation, and
EthereumRPC.GetBlockHasherror mapping tobchain.ErrBlockNotFound.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
db/sync.go |
Adds helper functions for bounded hash retries, reorg detection, and abort-aware worker waiting; updates parallel/bulk connect loops to use them. |
db/sync_test.go |
Adds unit regression tests covering resync signaling, abort handling, and deadlock prevention paths. |
bchain/coins/eth/ethrpc_blockhash_test.go |
Verifies EthereumRPC.GetBlockHash maps “not found” errors to bchain.ErrBlockNotFound for sync retry logic. |
Comments suppressed due to low confidence (1)
db/sync.go:695
getBlockWorkerunconditionally incrementsw.metrics.IndexResyncErrorson retryableGetBlockerrors. SinceSyncWorker.metricsis a pointer and constructors don’t enforce it being non-nil, this can panic in callers/tests that create a worker without metrics when a retryable error occurs. Either guard this increment withif w.metrics != nil(similar togetBlockHashForSync) or ensuremetricsis always initialized before starting workers.
w.metrics.IndexResyncErrors.With(common.Labels{"error": "failure"}).Inc()
select {
case <-terminating:
return
case <-w.chanOsSignal:
return
case <-time.After(cfg.RetryDelay):
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Refactored the sync coordinator and workers to handle persistent backend failures and chain reorgs more robustly. Previously, the sync process could get stuck in infinite retry loops if the chain tip moved or a block became temporarily unavailable.
This solution won over #1490