Skip to content

Bound retry loops and detect reorgs to prevent sync stalls#1491

Open
pragmaxim wants to merge 4 commits intomasterfrom
fix/stuck-syncing-2
Open

Bound retry loops and detect reorgs to prevent sync stalls#1491
pragmaxim wants to merge 4 commits intomasterfrom
fix/stuck-syncing-2

Conversation

@pragmaxim
Copy link
Copy Markdown
Contributor

@pragmaxim pragmaxim commented May 2, 2026

Refactored the sync coordinator and workers to handle persistent backend failures and chain reorgs more robustly. Previously, the sync process could get stuck in infinite retry loops if the chain tip moved or a block became temporarily unavailable.

  • Abstract sync logic into helpers (getBlockHashForSync, waitForBlockWorkers) to reduce duplication between parallel and bulk sync modes.
  • Implement deadlock prevention by using non-blocking selects on channel sends to workers.
  • Add reorg detection that triggers a sync restart (errResync) when the requested height is found to be past the current chain tip.
  • Enhance signal handling to ensure clean exits during worker synchronization.

This solution won over #1490

Map Avalanche "cannot query unfinalized data" RPC errors to ErrBlockNotFound
instead of swallowing them. This prevents failed trace calls from being treated
as successful empty results, which caused trace length mismatches during sync.
@pragmaxim pragmaxim force-pushed the fix/stuck-syncing-2 branch from 90b63c8 to 6020aa5 Compare May 2, 2026 05:51
@pragmaxim pragmaxim changed the title fix(stuck-syncing): bound retry loops that could stall chain-tip resync fix(stuck-syncing) V2: bound retry loops that could stall chain-tip resync May 2, 2026
Refactor the sync coordinator and workers to handle persistent backend
failures and chain reorgs more robustly. Previously, the sync process
could get stuck in infinite retry loops if the chain tip moved or a
block became temporarily unavailable.

- Abstract sync logic into helpers (getBlockHashForSync, waitForBlockWorkers) to reduce duplication between parallel and bulk sync modes.
- Implement deadlock prevention by using non-blocking selects on channel sends to workers.
- Add reorg detection that triggers a sync restart (errResync) when the requested height is found to be past the current chain tip.
- Enhance signal handling to ensure clean exits during worker synchronization.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the DB sync coordinator/worker logic to avoid sync stalls caused by unbounded retries when the chain tip moves (reorg/rollback) or the backend temporarily can’t serve a block/hash.

Changes:

  • Added bounded retry + “tip fell below requested height” detection in GetBlockHash coordination to trigger errResync instead of retrying indefinitely.
  • Refactored worker-coordination error handling via recordBlockWorkerAbort and waitForBlockWorkers to prevent coordinator deadlocks and improve shutdown behavior.
  • Added/expanded regression tests for reorg detection, worker abort propagation, and EthereumRPC.GetBlockHash error mapping to bchain.ErrBlockNotFound.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
db/sync.go Adds helper functions for bounded hash retries, reorg detection, and abort-aware worker waiting; updates parallel/bulk connect loops to use them.
db/sync_test.go Adds unit regression tests covering resync signaling, abort handling, and deadlock prevention paths.
bchain/coins/eth/ethrpc_blockhash_test.go Verifies EthereumRPC.GetBlockHash maps “not found” errors to bchain.ErrBlockNotFound for sync retry logic.
Comments suppressed due to low confidence (1)

db/sync.go:695

  • getBlockWorker unconditionally increments w.metrics.IndexResyncErrors on retryable GetBlock errors. Since SyncWorker.metrics is a pointer and constructors don’t enforce it being non-nil, this can panic in callers/tests that create a worker without metrics when a retryable error occurs. Either guard this increment with if w.metrics != nil (similar to getBlockHashForSync) or ensure metrics is always initialized before starting workers.
				w.metrics.IndexResyncErrors.With(common.Labels{"error": "failure"}).Inc()
				select {
				case <-terminating:
					return
				case <-w.chanOsSignal:
					return
				case <-time.After(cfg.RetryDelay):

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@pragmaxim pragmaxim changed the title fix(stuck-syncing) V2: bound retry loops that could stall chain-tip resync Bound retry loops and detect reorgs to prevent sync stalls May 2, 2026
Base automatically changed from fix/avax-retry-unfinalized-responses to master May 4, 2026 04:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants