-
Notifications
You must be signed in to change notification settings - Fork 957
Tree-sync friendly lookup sync tests #8592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: unstable
Are you sure you want to change the base?
Conversation
| // removed from the da_checker. Note that ALL components are removed from the da_checker | ||
| // so when we re-download and process the block we get the error | ||
| // MissingComponentsAfterAllProcessed and get stuck. | ||
| lookup.reset_requests(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug found on testing, lookups may be stuck given this sequence of events
| // sending retry requests to the disconnecting peer. | ||
| for sync_request_id in self.network.peer_disconnected(peer_id) { | ||
| self.inject_error(*peer_id, sync_request_id, RPCError::Disconnected); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor bug, we need to remove the peer from the sync states (e.g. self.block_lookups) then inject the disconnect events. Otherwise we may send requests to peers that are already disconnected. I don't think there's risk of sync getting stuck if libp2p rejects sending messages to disconnected peers, but deserves a fix anyway.
|
This pull request has merge conflicts. Could you please resolve them @dapplion? 🙏 |
| .get_blinded_block(block_root) | ||
| .unwrap() | ||
| .unwrap_or_else(|| { | ||
| panic!("block root does not exist in external harness {block_root:?}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't always "external" harness
| panic!("block root does not exist in external harness {block_root:?}") | |
| panic!("block root does not exist in harness {block_root:?}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
|
|
||
| #[cfg(test)] | ||
| #[derive(Debug)] | ||
| /// Tuple of `SingleLookupId`, requested block root, awaiting parent block root (if any), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please update doc - no longer a tuple and awaiting parent block removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
beacon_node/network/Cargo.toml
Outdated
| k256 = "0.13.4" | ||
| kzg = { workspace = true } | ||
| matches = "0.1.8" | ||
| paste = "1.0.15" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| paste = "1.0.15" | |
| paste = { workspace = true } |
| RECENT_FORKS_BEFORE_GLOAS=electra fulu | ||
|
|
||
| # List of all recent hard forks. This list is used to set env variables for http_api tests | ||
| # Include phase0 to test the code paths in sync that are pre blobs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have nightly-tests that runs prior fork tests
#8319
But i just realised it hasn't been activated on the sigp fork because github only run scheduled workflows from the main branch (stable), we can either wait until the release or have a separate PR to stable to activate this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a PR to activate these nightly tests:
#8636
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we keep this for network tests only? It's just one extra fork and makes it easy to debug and catch errors. For sync tests we should keep the forks that add new objects like run only
- phase0, deneb, fulu
| /// Beacon chain harness | ||
| harness: BeaconChainHarness<EphemeralHarnessType<E>>, | ||
| /// External beacon chain harness to produce blocks that are not imported | ||
| external_harness: BeaconChainHarness<EphemeralHarnessType<E>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Any reason to have this as a field on TestRig? I see that it's only used in build_chain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good find! Moved to build_chain
|
|
||
| // Inject a Disconnected error on all requests associated with the disconnected peer | ||
| // to retry all batches/lookups. Only after removing the peer from the data structures to | ||
| // sending retry requests to the disconnecting peer. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing word i think
| // sending retry requests to the disconnecting peer. | |
| // avoid sending retry requests to the disconnecting peer. |
|
Some required checks have failed. Could you please take a look @dapplion? 🙏 |
| None | ||
| // Network / external peers simulated behaviour | ||
|
|
||
| async fn simulate(&mut self, complete_strategy: CompleteStrategy) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to add some brief docs here on what it does and how to use this.
The names simulate and CompleteStrategy are a bit generic. I thought about NetworkResponseStrategy or PeerBehaviourStrategy but realised the struct also has non peer behaviour related configuration, like process_result_conditional and block_imported_while_processing. I can't think of a better name right now though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe SimulateConfig?
| self | ||
| } | ||
|
|
||
| fn process_result<F>(mut self, f: F) -> Self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can improve this name to make it obvious this is implementing builder pattern, rather than actually processing results, could use the standard with_ prefix.
| self | ||
| } | ||
|
|
||
| fn block_imported_while_processing(mut self, block_root: Hash256) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same as above
| &self.validator_keypairs[*validator_index].sk.sign(message), | ||
| ); | ||
| // If disable_crypto is true keep the attestation signature as infinity | ||
| if self.chain.config.test_config.disable_crypto { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we're missing a ! here?
| &self.validator_keypairs[*validator_index].sk.sign(message), | ||
| ); | ||
| // If disable_crypto is true keep the attestation signature as infinity | ||
| if self.chain.config.test_config.disable_crypto { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing a ! here?
| /// Lookup's Id | ||
| id: Id, | ||
| block_root: Hash256, | ||
| max_seen_peers: HashSet<PeerId>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think max_ isn't necessary here, or does it mean something other than seen peers?
| // Trigger the request | ||
| rig.trigger_unknown_block_from_attestation(block_hash, peer_id); | ||
| let id = rig.expect_block_lookup_request(block_hash); | ||
| /// Assert that sync completes from a GossipUnknownParentBlob / UknownDataColumnParent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
| /// Assert that sync completes from a GossipUnknownParentBlob / UknownDataColumnParent | |
| /// Assert that sync completes from a GossipUnknownParentBlob / UnknownDataColumnParent |
| /// Test added in https://github.com/sigp/lighthouse/commit/84c7d8cc7006a6f1f1bb5729ab222b9f85f72727 | ||
| /// TODO: This test was added on a very old version of lookup sync. It's unclear if the situation | ||
| /// it wants to recreate is possible or problematic in current code. Skipping. | ||
| #[ignore] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this test failing?
| chain_hash, | ||
| BlockProcessingResult::Ok(AvailabilityProcessingStatus::Imported(chain_hash)), | ||
| ); | ||
| fn reset_metrics(&mut self) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reset_metrics seems a bit misleading here, perhaps something like capture_metrics_baseline is more suitable?
| let id = self.find_single_lookup_for(self.find_oldest_parent_lookup(chain_hash)); | ||
| self.single_blob_component_processed(id, result); | ||
| fn completed_lookups(&self) -> usize { | ||
| // Substract initial value to allow resetting metrics mid test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo
| // Substract initial value to allow resetting metrics mid test | |
| // Subtract initial value to allow resetting metrics mid test |
| self.sync_rx_queue.push(ev); | ||
| } | ||
|
|
||
| // Choose at random which queue to process first |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the purpose of deterministically choosing a random queue to process?
jimmygchen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @dapplion
I've done a round of review and have added a few comments. There's a logic error in the condition and the CI failures that needs to be fixed.
Overall I think the approach is good - offloading complexity to TestRig makes individual tests cleaner. I'm slightly concerned about the growing complexity of TestRig itself and how easy it would be to extend it in the future, but I guess this may be needed as sync itself is inheritlenty complex.
- Fix logic error: add missing `!` in attestation signing conditions - Fix typos: Substract -> Subtract, UknownDataColumnParent -> UnknownDataColumnParent - Rename reset_metrics to capture_metrics_baseline - Rename max_seen_peers to seen_peers - Use with_ prefix for builder methods (with_process_result, with_block_imported_while_processing) - Add documentation to simulate fn and CompleteStrategy
Issue Addressed
Current lookup sync tests are written in an explicit way that assume how the internals of lookup sync work. For example the test would do:
This is unnecessarily verbose. And it will requires a complete re-write when something changes in the internals of lookup sync (has happened a few times, mostly for deneb and fulu).
What we really want to assert is:
Proposed Changes
Keep all existing tests and add new cases but written in the new style described above. The logic to serve and respond to request is in this function
fn simulatehttps://github.com/dapplion/lighthouse/blob/2288a3aeb11164bb1960dc803f41696c984c69ff/beacon_node/network/src/sync/tests/lookups.rs#L301CompleteStrategywhere you can set for example "respond to BlocksByRoot requests with empty"TestConfigAlong the way I found a couple bugs, which I documented on the diff.
Review guide
Look at
lighthouse/beacon_node/network/src/sync/tests/lookups.rsdirectly (no diff).Other changes are very minor and should not affect production paths