-
Notifications
You must be signed in to change notification settings - Fork 242
feat: High availabilty via RAFT #2836
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2836 +/- ##
==========================================
- Coverage 64.53% 62.05% -2.48%
==========================================
Files 81 88 +7
Lines 7382 8083 +701
==========================================
+ Hits 4764 5016 +252
- Misses 2076 2490 +414
- Partials 542 577 +35
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
* main: build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851) build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844) build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)
(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)
|
@claude review again |
|
Claude finished @alpe's task —— View job Comprehensive Code ReviewI've analyzed this high-availability RAFT implementation PR. Here's my detailed review across multiple dimensions: 1. Architecture & Design Patterns✅ StrengthsModular RAFT Integration: The separation of concerns is well-executed:
Clean State Machine: The Factory Pattern: The use of factory functions for leader/follower initialization (
|
julienrbrt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
first walkthrough, I'll go read about github.com/hashicorp/raft and come back to review election.go and node.go
| return nil | ||
| } | ||
|
|
||
| // Height returns the current height stored |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to know the height of the p2p (go header) store? (I am still reviewing, this may get clearer). We can have the app height from the evolve store.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When the node switches from sync to aggregator mode, the internal state is key to prevent double signing.
The Syncer has a isCatchingUpState method now that checks the stores for any height > current.
it is called within the leader election loop to transfer leadership in case it is not fully synced, yet.
| } | ||
|
|
||
| // SetApplyCallback sets a callback to be called when log entries are applied | ||
| func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: what is this for? the go doc is very light
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The channel is passed by the syncer to receive first level state updates from within the raft cluster. This should be the fastest communication channel available.
| }() | ||
|
|
||
| // Check raft leadership if raft is enabled | ||
| if e.raftNode != nil && !e.raftNode.IsLeader() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unrelated: i wonder how this will play with different sequencers.
In #2797 you can get to that path without node key (to sign). I suppose we'll need to add a condition for based sequencing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I was only preparing for single sequencer. Base would not work with raft as there are no aggregators.
| leaderFactory := func() (raftpkg.Runnable, error) { | ||
| logger.Info().Msg("Starting aggregator-MODE") | ||
| nodeConfig.Node.Aggregator = true | ||
| nodeConfig.P2P.Peers = "" // peers are not supported in aggregator mode |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure I understand this. is the aggregator broadcasting to no one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the aggregator is required to broadcast to at least one node part of a larger mesh other wise p2p will not work
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is more who calls whom. The aggregator gets called not otherwise. Starting all nodes with p2p-peer setup makes sense though. When a ha cluster is setup, the raft leader gets the aggregator role and I clear the peers when the p2p stack is restarted.
There is an error thrown somewhere when peers are not empty.
node/full.go
Outdated
| func initRaftNode(nodeConfig config.Config, logger zerolog.Logger) (*raftpkg.Node, error) { | ||
| raftDir := nodeConfig.Raft.RaftDir | ||
| if raftDir == "" { | ||
| raftDir = filepath.Join(nodeConfig.RootDir, "raft") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we should be using DefaultConfig() value if empty.
| return fmt.Errorf("not leader") | ||
| } | ||
|
|
||
| data, err := json.Marshal(state) // todo:use protobuf |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why the todo? size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should migrate to protobuf here. json will cause overhead, at 100ms we need to minimise it as much as possible
* main: chore: fix some comments (#2874) chore: bump node in evm-single (#2875) refactor(syncer,cache): use compare and swap loop and add comments (#2873) refactor: use state da height as well (#2872) refactor: retrieve highest da height in cache (#2870) chore: change from event count to start and end height (#2871)
## Overview Speed up cache write/loads via parallel execution. Pulled from #2836
## Overview Minor updates to make it easier to trace errors Extracted from #2836
* main: chore: remove extra github action yml file (#2882) fix(execution/evm): verify payload status (#2863) feat: fetch included da height from store (#2880) chore: better output on errors (#2879) refactor!: create da client and split cache interface (#2878) chore!: rename `evm-single` and `grpc-single` (#2839) build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876) chore: parallel cache de/serialization (#2868) chore: bump blob size (#2877)
|
|
||
| // Propose block to raft to share state in the cluster | ||
| if e.raftNode != nil { | ||
| headerBytes, err := header.MarshalBinary() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: in the flow of this function we decode once and encode twice this data. i wonder if we can make it only decode. This can be a follow up to not inflate this pr.
| database ds.Batching, | ||
| logger zerolog.Logger, | ||
| ) (ln *LightNode, err error) { | ||
| p2pClient, err := p2p.NewClient(conf.P2P, nodeKey.PrivKey, database, genesis.ChainID, logger, nil) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is reasoning behind moving this from the composing part to the constructor?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is not strong reason for this other than make it consistent with full node constructor.
With full-nodes, the p2p client is setup in the failover.go so that it can be reset when the sync-node becomes leader and peers must be empty.
tac0turtle
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
overall looks good to me. what sort of latency does the node have for the switch?
* main: build(deps): Bump mdast-util-to-hast from 13.2.0 to 13.2.1 in /docs in the npm_and_yarn group across 1 directory (#2900) refactor(block): centralize timeout in client (#2903) build(deps): Bump the all-go group across 2 directories with 3 updates (#2898) chore: bump default timeout (#2902) fix: revert default db (#2897) refactor: remove obsolete // +build tag (#2899) fix:da visualiser namespace (#2895) refactor: omit unnecessary reassignment (#2892) build(deps): Bump the all-go group across 5 directories with 6 updates (#2881) chore: fix inconsistent method name in retryWithBackoffOnPayloadStatus comment (#2889) fix: ensure consistent network ID usage in P2P subscriber (#2884) build(deps): Bump golangci/golangci-lint-action from 9.0.0 to 9.1.0 (#2885) build(deps): Bump actions/checkout from 5 to 6 (#2886)
Implement failover via RAFT