-
Notifications
You must be signed in to change notification settings - Fork 837
Shutdown chain router before waiting for chain creator to exit when shutting down #4633
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
…hutting down
Currently, when the X-chain executes transactions while bootstrapping, it cannot halt until it finishes executing the current batch of transactions,
because as seen in the stack trace below, the X-chain's bootstrapping is done from within the chain creation goroutine.
In order to stop the bootstrapper of the X-chain, the chain router needs to be stopped which in turn stops the various chains.
However, before stopping the chain router, the chain manager's shutdown procedure waits for the chain creation procedure to finish, which
cannot finish before the bootstrapping is done.
Fortunately, it's safe to close the chainc creation queue first, which prevents new chains from forming, and then proceed to stop the chain router
and only afterwards to wait for the chain creation to finish.
The reason it's safe is because the chain router simply calls Stop() on all chains, and each chain's stop procedure simply calls thread safe methods.
When compiling the node with race detector and shutting down the chain, it doesn't complain:
[12-01|19:37:36.630] INFO <X Chain> queue/jobs.go:202 executing operations {"numExecuted": 4157, "numToExecute": 248432, "eta": "29m23s"}
^C[12-01|19:37:42.065] INFO node/node.go:1841 shutting down node {"exitCode": 0}
[12-01|19:37:42.065] INFO health/worker.go:115 registered new check and initialized its state to failing {"name": "health", "name": "shuttingDown", "tags": ["application"]}
[12-01|19:37:42.066] INFO chains/manager.go:1526 shutting down chain manager
[12-01|19:37:42.066] INFO router/chain_router.go:374 shutting down chain router
[12-01|19:37:42.067] INFO <P Chain> bootstrap/bootstrapper.go:805 shutting down bootstrapper
[12-01|19:37:42.068] INFO <P Chain> router/chain_router.go:398 chain shutdown {"shutdownDuration": "1.737125ms"}
[12-01|19:37:42.069] INFO <X Chain> queue/jobs.go:137 interrupted execution {"numExecuted": 4956}
[12-01|19:37:42.069] INFO <X Chain> bootstrap/bootstrapper.go:307 shutting down Bootstrapper
[12-01|19:37:42.070] INFO <X Chain> router/chain_router.go:398 chain shutdown {"shutdownDuration": "3.656708ms"}
[12-01|19:37:42.070] INFO network/network.go:1142 shutting down the p2p networking
[12-01|19:37:42.079] INFO nat/nat.go:193 Unmapped all ports
[12-01|19:37:42.079] INFO node/node.go:1895 cleaning up plugin runtimes
[12-01|19:37:42.090] INFO node/node.go:1923 finished node shutdown
/Users/avalancheuser/go/src/github.com/ava-labs/avalanchego/snow/engine/avalanche/bootstrap/bootstrapper.go:589 +0x4e0
github.com/ava-labs/avalanchego/snow/engine/avalanche/bootstrap.(*Bootstrapper).Start(0x140049bd6c0, {0x104324158, 0x10617cc80}, 0x0)
/Users/avalancheuser/go/src/github.com/ava-labs/avalanchego/snow/engine/avalanche/bootstrap/bootstrapper.go:365 +0x608
github.com/ava-labs/avalanchego/snow/networking/handler.(*handler).Start(0x140069abce0, {0x104324158, 0x10617cc80}, 0x0)
/Users/avalancheuser/go/src/github.com/ava-labs/avalanchego/snow/networking/handler/handler.go:257 +0x2b4
github.com/ava-labs/avalanchego/chains.(*manager).createChain(0x14000774008, {{0xab, 0x68, 0xeb, 0x1e, 0xe1, 0x42, 0xa0, 0x5c, 0xfe, ...}, ...})
/Users/avalancheuser/go/src/github.com/ava-labs/avalanchego/chains/manager.go:471 +0x1504
github.com/ava-labs/avalanchego/chains.(*manager).dispatchChainCreator(0x14000774008)
/Users/avalancheuser/go/src/github.com/ava-labs/avalanchego/chains/manager.go:1517 +0xcc
created by github.com/ava-labs/avalanchego/chains.(*manager).StartChainCreator in goroutine 1
/Users/avalancheuser/go/src/github.com/ava-labs/avalanchego/chains/manager.go:1494 +0x108
Signed-off-by: Yacov Manevich <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes a shutdown race condition where the chain manager would deadlock waiting for chain creation to complete before stopping the chain router, but chain creation could be blocked by bootstrapping chains that needed the router to stop first.
Key changes:
- Reordered shutdown sequence to stop chain router before waiting for chain creator goroutine to exit
- This allows bootstrapping chains to be interrupted, preventing the deadlock scenario
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
StephenButtolph
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If there is currently a chain being created. Won't the chain router shutdown race with adding the chain to the chain router?
Since the chain router's methods are all synchronized, it's mutually exclusive with the shutdown. If the addition happens first, then we're fine because the chain router will stop the chain. |
Why this should be merged
Currently, when the X-chain executes transactions while bootstrapping, it cannot halt until it finishes executing the current batch of transactions, because as seen in the stack trace below, the X-chain's bootstrapping is done from within the chain creation goroutine. In order to stop the bootstrapper of the X-chain, the chain router needs to be stopped which in turn stops the various chains. However, before stopping the chain router, the chain manager's shutdown procedure waits for the chain creation procedure to finish, which cannot finish before the bootstrapping is done.
How this works
Fortunately, it's safe to close the chain creation queue first, which prevents new chains from forming, and then proceed to stop the chain router and only afterwards to wait for the chain creation to finish.
The reason it's safe is because the chain router simply calls Stop() on all chains, and each chain's stop procedure simply calls thread safe methods.
How this was tested
When compiling a bootstrapping Fuji node with race detector and shutting down the chain, it doesn't complain:
Need to be documented in RELEASES.md?