feat: High availabilty via RAFT #2836

alpe · 2025-11-12T15:34:17Z

Implement failover via RAFT

Improve Cache startup/shutdown with parallelization
Publish to RAFT cluster in executor
Sync DB after each block created in executor
Add new RaftReceiver to sync when in aggregator follower mode
Introduce failoverState to switch between follower/leader mode
Provide RAFT node details via http endpoint

github-actions · 2025-11-12T15:34:31Z

The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).

Build	Format	Lint	Breaking	Updated (UTC)
`✅ passed`	`⏩ skipped`	`✅ passed`	`✅ passed`	Dec 3, 2025, 4:47 PM

codecov · 2025-11-12T15:39:05Z

Codecov Report

❌ Patch coverage is 41.48816% with 519 lines in your changes missing coverage. Please review.
✅ Project coverage is 62.05%. Comparing base (ded4f34) to head (695324e).
⚠️ Report is 1 commits behind head on main.

Files with missing lines	Patch %	Lines
pkg/raft/node.go	12.50%	168 Missing ⚠️
pkg/raft/node_mock.go	45.08%	74 Missing and 21 partials ⚠️
block/internal/syncing/raft_retriever.go	0.00%	60 Missing ⚠️
node/full.go	32.81%	36 Missing and 7 partials ⚠️
node/failover.go	74.45%	22 Missing and 13 partials ⚠️
block/internal/syncing/syncer.go	28.88%	30 Missing and 2 partials ⚠️
block/internal/executing/executor.go	6.89%	23 Missing and 4 partials ⚠️
pkg/raft/election.go	79.26%	12 Missing and 5 partials ⚠️
pkg/rpc/server/http.go	6.66%	13 Missing and 1 partial ⚠️
block/internal/syncing/assert.go	57.89%	4 Missing and 4 partials ⚠️
... and 9 more

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2836      +/-   ##
==========================================
- Coverage   64.53%   62.05%   -2.48%     
==========================================
  Files          81       88       +7     
  Lines        7382     8083     +701     
==========================================
+ Hits         4764     5016     +252     
- Misses       2076     2490     +414     
- Partials      542      577      +35

Flag	Coverage Δ
combined	`62.05% <41.48%> (-2.48%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

* main: fix: remove duplicate error logging in light node shutdown (#2841) chore: fix incorrect function name in comment (#2840) chore: remove sequencer go.mod (#2837)

* main: build(deps): Bump the go_modules group across 2 directories with 3 updates (#2846) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.7.0 to 1.8.0 in /test/e2e (#2851) build(deps): Bump github.com/consensys/gnark-crypto from 0.18.0 to 0.18.1 in /test/e2e (#2844) build(deps): Bump github.com/cometbft/cometbft from 0.38.17 to 0.38.19 in /test/e2e (#2843) build(deps): Bump github.com/dvsekhvalnov/jose2go from 1.6.0 to 1.7.0 in /test/e2e (#2845)

(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)

alpe · 2025-11-19T09:54:45Z

@claude review again

claude · 2025-11-19T09:55:03Z

Claude finished @alpe's task —— View job

Comprehensive Code Review

I've analyzed this high-availability RAFT implementation PR. Here's my detailed review across multiple dimensions:

1. Architecture & Design Patterns

✅ Strengths

Modular RAFT Integration: The separation of concerns is well-executed:

pkg/raft/node.go: Core RAFT node abstraction
pkg/raft/election.go: Leader election orchestration with dynamic role switching
node/failover.go: Failover state management decoupled from business logic
Integration points are well-defined through interfaces (e.g., Runnable, RaftNode)

Clean State Machine: The DynamicLeaderElection.Run() in pkg/raft/election.go:54-152 implements a robust state machine with clear transitions between leader/follower modes.

Factory Pattern: The use of factory functions for leader/follower initialization (leaderFactory, followerFactory in node/full.go:97-107) enables clean separation and testability.

⚠️ Issues & Recommendations

1. JSON Serialization Performance Concern (pkg/raft/node.go:211)

data, err := json.Marshal(state) // todo:use protobuf

Critical Issue: At 100ms block times, JSON serialization overhead can impact consensus performance significantly. This TODO should be prioritized.

Recommendation:

Implement protobuf serialization immediately
Add benchmarks comparing JSON vs protobuf performance
Estimate impact: JSON typically 2-5x slower than protobuf for structured data

2. Silent Message Drops (pkg/raft/node.go:291-297)

select {
case f.applyCh <- RaftApplyMsg{Index: log.Index, State: &state}:
default:
    f.logger.Warn().Msg("apply channel full, dropping message")
}

Issue: Dropped messages could lead to nodes falling behind without recovery mechanism.

Recommendation:

Add metrics to track dropped messages
Implement backpressure mechanism
Consider increasing buffer size or making it configurable
Document expected behavior when messages are dropped

3. Bootstrap Configuration Validation (node/full.go:90-123)

case !nodeConfig.Node.Aggregator && !nodeConfig.Raft.Enable:
    // sync mode without raft
default:
    return nil, fmt.Errorf("raft config must be used in sequencer setup only")

Issue: Error message is unclear - doesn't explain invalid configuration.

Recommendation: Improve error message to explain valid configurations.

2. Error Handling & Resilience

⚠️ Critical Issues

1. Leadership Transfer Failure Handling (pkg/raft/election.go:99-102)

if err := d.node.leadershipTransfer(); err != nil && !errors.Is(err, raft.ErrNotLeader) {
    return err // Node stops entirely
}

Issue: If leadership transfer fails (e.g., no suitable leader), the node stops. This could cause cluster instability.

Recommendation:

Implement retry logic with exponential backoff
Add configurable timeout for leadership transfer
Consider degraded mode instead of full stop
Log detailed error context for debugging

2. Missing Context Propagation (block/internal/syncing/raft_retriever.go:103-134)

func (r *raftRetriever) consumeRaftBlock(ctx context.Context, state *raft.RaftBlockState) error {
    // ctx is passed but not checked in tight loops
}

Issue: Long-running operations don't check context cancellation, potentially blocking shutdown.

Recommendation: Add context checks in processing loops.

3. Database Flush Without Error Handling (block/internal/executing/executor.go:434-439)

if e.raftNode != nil {
    headerBytes, err := header.MarshalBinary()
    if err != nil {
        return fmt.Errorf("failed to marshal header: %w", err)
    }

Missing: The code after this (from your review) should verify database flush completed before RAFT broadcast. If RAFT broadcast succeeds but DB flush fails, state inconsistency occurs.

Recommendation:

Ensure DB flush completes before RAFT broadcast
Add explicit error handling for flush operations
Consider transaction boundaries

4. Validation Gap in RAFT State (pkg/raft/types.go:17-24)

func (s RaftBlockState) assertValid(next RaftBlockState) error {
    if s.Height > next.Height {
        return fmt.Errorf("invalid height: %d > %d", s.Height, next.Height)
    }
    // ...
}

Issue: Only checks monotonic increase, doesn't validate:

Gap detection (height jumps)
Hash chain continuity
Timestamp reasonableness (could be far future)

Recommendation: Add comprehensive validation including gap detection and chain continuity checks.

3. Security Considerations

⚠️ Issues

1. No Mutual TLS for RAFT Communication (pkg/raft/node.go:87-89)

transport, err := raft.NewTCPTransport(cfg.RaftAddr, addr, 3, 10*time.Second, os.Stderr)

Issue: RAFT cluster communication is unencrypted. In production, this exposes:

Block data in transit
Potential MITM attacks
No node authentication beyond configuration

Recommendation:

Implement TLS for RAFT transport
Add certificate-based authentication
Document security requirements in production
Consider network isolation requirements

2. No Rate Limiting on RAFT Apply (pkg/raft/node.go:279-300)
Issue: No protection against DoS through excessive RAFT proposals.

Recommendation:

Add rate limiting per node
Implement backpressure mechanisms
Monitor proposal rates

3. Peer Address Validation (pkg/raft/node.go:344-364)

// we can skip address validation as they come from a local configuration

Issue: Comment suggests trust in configuration, but misconfiguration can cause issues.

Recommendation: Add basic sanity checks (valid port ranges, non-empty addresses, reachability probes).

4. Concurrency & Resource Management

⚠️ Issues

1. Potential Goroutine Leak (pkg/raft/election.go:66-82)

startWorker := func(name string, workerFunc func(ctx context.Context) error) {
    workerCancel()
    workerCtx, cancel := context.WithCancel(ctx)
    workerCancel = cancel
    wg.Add(1)
    go func(childCtx context.Context) {
        defer wg.Done()
        if err := workerFunc(childCtx); err != nil && !errors.Is(err, context.Canceled) {
            select {
            case errCh <- fmt.Errorf(name+" worker exited unexpectedly: %s", err):
            default: // do not block
            }
        }
    }(workerCtx)
}

Issue: If errCh is full and worker errors occur, errors are silently dropped. This could mask critical failures.

Recommendation:

Increase error channel buffer size
Log when errors can't be sent
Consider using a more robust error aggregation pattern

2. Race Condition Risk (block/internal/syncing/syncer.go:177-189)

func (s *Syncer) isCatchingUpState() bool {
    return len(s.heightInCh) != 0 || func() bool {
        currentHeight, err := s.store.Height(s.ctx)
        // ...
        return s.raftRetriever != nil && s.raftRetriever.Height() > currentHeight
    }()
}

Issue: Multiple store height checks without synchronization could race with concurrent updates.

Recommendation: Add mutex or use atomic operations for height comparisons.

3. Shutdown Ordering (node/failover.go:177-232)

defer func() {
    shutdownCtx, done := context.WithTimeout(context.Background(), 3*time.Second)
    defer done()
    _ = f.rpcServer.Shutdown(shutdownCtx)
}()

Issue: Hard-coded 3-second timeout may not be sufficient for graceful shutdown under load.

Recommendation:

Make shutdown timeout configurable
Implement phased shutdown (stop accepting new requests → drain existing → force close)
Log shutdown progress

4. BoltDB Concurrent Access (pkg/raft/node.go:67-74)

logStore, err := raftboltdb.NewBoltStore(filepath.Join(cfg.RaftDir, "raft-log.db"))
stableStore, err := raftboltdb.NewBoltStore(filepath.Join(cfg.RaftDir, "raft-stable.db"))

Note: BoltDB doesn't support concurrent writers. Ensure RAFT library handles this correctly (it likely does, but worth documenting).

5. Performance & Resource Efficiency

⚠️ Issues

1. Memory Allocation in Hot Path (pkg/raft/node.go:206-222)

func (n *Node) Broadcast(_ context.Context, state *RaftBlockState) error {
    data, err := json.Marshal(state) // Allocates every call
    // ...
}

Issue: JSON marshaling allocates on every block production (every 100ms).

Recommendation:

Use sync.Pool for buffer reuse
Switch to protobuf (lower allocation)
Profile allocation hotspots

2. Unbounded Channel Operations (block/internal/syncing/raft_retriever.go:62)

applyCh := make(chan raft.RaftApplyMsg, 1)

Issue: Buffer size of 1 is very small; could cause blocking on high throughput.

Recommendation:

Make buffer size configurable based on expected block rate
Monitor channel saturation
Document expected throughput

3. Blocking Operations in Election Loop (pkg/raft/election.go:96)

raftSynced := d.node.waitForMsgsLanded(d.node.Config().SendTimeout) == nil

Issue: Blocks election loop, delaying leadership decisions.

Recommendation: Consider async status checks or timeout optimization.

6. Testing Coverage & Quality

✅ Strengths

Good unit test coverage for helper functions (splitPeerAddr, deduplicateServers in pkg/raft/node_test.go)
Comprehensive election state machine tests (pkg/raft/election_test.go)
E2E failover test (test/e2e/failover_e2e_test.go)

⚠️ Gaps

1. Missing Integration Tests:

No tests for RAFT network partitions (split-brain scenarios)
No tests for concurrent leadership changes
Missing crash recovery tests
No tests for state validation edge cases

2. Low Coverage on Critical Paths (from codecov report):

pkg/raft/node.go: 12.50% coverage (168 lines missing)
block/internal/syncing/raft_retriever.go: 0.00% coverage (60 lines missing)
node/failover.go: 74.45% coverage (35 lines missing)

Recommendation:

Add chaos engineering tests (network delays, partitions, crashes)
Test leadership transfer under various loads
Test recovery from various failure modes
Increase coverage for error paths

3. Mock Coverage (pkg/raft/node_mock.go):

45% coverage for mocks suggests tests aren't exercising all code paths

4. Missing Benchmarks:

No performance benchmarks for RAFT broadcast latency
No benchmarks for failover time
No load testing for sustained throughput

7. Documentation & Code Quality

⚠️ Issues

1. Insufficient Documentation:

SetApplyCallback (pkg/raft/node.go:274) has good docs, but most RAFT methods lack detailed documentation
No architecture documentation for RAFT integration
Missing operational runbook (how to recover from failures, add/remove nodes, monitor health)

2. Magic Numbers:

ticker := time.NewTicker(300 * time.Millisecond) // election.go:83
ReadTimeout:  10 * time.Second, // failover.go:158

Recommendation: Extract to named constants with explanatory comments.

3. Logging Inconsistency:

Mix of Debug/Info/Warn/Error levels
Some critical operations lack logging (e.g., database flushes)
Consider structured logging for correlation across cluster

8. Specific Code Issues

🔴 Critical

1. Double Sign Risk (pkg/raft/election.go:95-104)

raftSynced := d.node.waitForMsgsLanded(d.node.Config().SendTimeout) == nil
if !raftSynced || !runnable.IsSynced(d.node.GetState()) {
    d.logger.Info().Msg("became leader, but not synced. Pass on leadership")
    if err := d.node.leadershipTransfer(); err != nil && !errors.Is(err, raft.ErrNotLeader) {
        return err
    }
    continue
}

Good: Checks sync status before becoming leader to prevent double signing.

Issue: If leadership transfer fails, the node might stop entirely, reducing cluster capacity.

Recommendation: Implement more sophisticated handling - perhaps stay as follower until properly synced.

2. Store Flush Timing (per review comment thread)
The discussion around Commit() and flush behavior is important. Current implementation flushes before RAFT broadcast, which is correct, but:

Recommendation:

Document the critical ordering: DB flush → RAFT broadcast → P2P broadcast
Add assertions/tests to verify this ordering
Consider making SyncWrites configurable with clear performance trade-off documentation

⚠️ High Priority

3. Peer Duplication Check (pkg/raft/node.go:137-139)

if svrs := deduplicateServers(cfg.Servers); len(svrs) != len(cfg.Servers) {
    return fmt.Errorf("duplicate peers found in config: %v", cfg.Servers)
}

Issue: Deduplicates by ID only, but error message suggests checking Address too. Could have same ID with different addresses.

Recommendation: Check both ID and Address for duplicates, or clarify which is the key.

4. Error Propagation (block/internal/executing/executor.go:330-332)

if e.raftNode != nil && !e.raftNode.IsLeader() {
    return errors.New("not raft leader")
}

Issue: Generic error message doesn't help with debugging.

Recommendation: Include current leader ID and node state in error.

9. Configuration & Operational Concerns

⚠️ Issues

1. No Default Validation (pkg/config/config.go:322-325)

if err := c.Raft.Validate(); err != nil {
    return err
}

Question: What does Raft.Validate() check? Need to review config validation for:

Heartbeat timeout vs election timeout ratios
Send timeout vs block time relationships
Minimum cluster size

2. Missing Health Checks:

No RAFT-specific health endpoint beyond basic RPC
Should expose: current role, cluster size, sync status, last applied index

Recommendation: Add /raft/status endpoint with detailed cluster health.

3. Metrics Gaps:

No RAFT-specific metrics (proposals/sec, apply latency, leadership changes)
Should track: failover count, leadership tenure, sync lag

10. Comparison with Review Comments

Addressing existing review comments:

✅ Cache parallelization (per julienrbrt comment): Agreed this is a good improvement
✅ P2P height tracking (syncer.go:310): Your explanation makes sense - RAFT provides the source of truth
⚠️ Protobuf TODO (node.go:211): As mentioned, this is critical for performance - should not wait
✅ isCatchingUpState naming (per julienrbrt): Consider isRaftSynced() or isClusterSynced() to be more specific
⚠️ Commit/Flush behavior: The discussion about SyncWrites is important - current approach is safe but needs documentation

Summary & Recommendations

Must Fix Before Merge

Implement Protobuf serialization for RAFT state (pkg/raft/node.go:211)
Add error handling for dropped RAFT messages with metrics (pkg/raft/node.go:296)
Improve leadership transfer failure handling (pkg/raft/election.go:99-102)
Add TLS support for RAFT transport or document security requirements
Fix validation gaps in RaftBlockState.assertValid() (pkg/raft/types.go)

High Priority

Increase test coverage for critical paths (currently 41.48% patch coverage)
Add integration tests for failure scenarios (network partition, crash recovery)
Document operational procedures (recovery, scaling, monitoring)
Add comprehensive health checks and metrics
Review and test shutdown ordering under load

Nice to Have

Make timeouts configurable (currently hard-coded)
Add performance benchmarks for RAFT operations
Improve error messages with contextual information
Add structured logging with correlation IDs
Consider implementing backpressure for apply channel

Architecture Validation

The overall architecture is sound:

✅ Clean separation between RAFT consensus and business logic
✅ Well-designed state machine for leadership election
✅ Good use of interfaces for testability
✅ Proper handling of sync-before-leadership to prevent double signing

However, the implementation needs refinement in error handling, testing, and performance optimization before production readiness.

Overall Assessment: This is a solid foundation for HA via RAFT, but needs attention to the critical issues above, particularly around error resilience, security, and test coverage before production deployment. The architectural decisions are sound, but implementation details need hardening.

julienrbrt

first walkthrough, I'll go read about github.com/hashicorp/raft and come back to review election.go and node.go

block/internal/cache/generic_cache.go

pkg/cmd/run_node.go

julienrbrt · 2025-11-19T13:50:26Z

pkg/sync/sync_service.go

 	return nil
 }

+// Height returns the current height stored


Why do we need to know the height of the p2p (go header) store? (I am still reviewing, this may get clearer). We can have the app height from the evolve store.

When the node switches from sync to aggregator mode, the internal state is key to prevent double signing.
The Syncer has a isCatchingUpState method now that checks the stores for any height > current.
it is called within the leader election loop to transfer leadership in case it is not fully synced, yet.

julienrbrt · 2025-11-19T13:51:59Z

pkg/raft/node.go

+}
+
+// SetApplyCallback sets a callback to be called when log entries are applied
+func (n *Node) SetApplyCallback(ch chan<- RaftApplyMsg) {


nit: what is this for? the go doc is very light

The channel is passed by the syncer to receive first level state updates from within the raft cluster. This should be the fastest communication channel available.

julienrbrt · 2025-11-19T13:54:16Z

block/internal/executing/executor.go

 	}()

+	// Check raft leadership if raft is enabled
+	if e.raftNode != nil && !e.raftNode.IsLeader() {


unrelated: i wonder how this will play with different sequencers.
In #2797 you can get to that path without node key (to sign). I suppose we'll need to add a condition for based sequencing.

yes, I was only preparing for single sequencer. Base would not work with raft as there are no aggregators.

julienrbrt · 2025-11-19T14:02:47Z

node/full.go

+	leaderFactory := func() (raftpkg.Runnable, error) {
+		logger.Info().Msg("Starting aggregator-MODE")
+		nodeConfig.Node.Aggregator = true
+		nodeConfig.P2P.Peers = "" // peers are not supported in aggregator mode


not sure I understand this. is the aggregator broadcasting to no one?

the aggregator is required to broadcast to at least one node part of a larger mesh other wise p2p will not work

This is more who calls whom. The aggregator gets called not otherwise. Starting all nodes with p2p-peer setup makes sense though. When a ha cluster is setup, the raft leader gets the aggregator role and I clear the peers when the p2p stack is restarted.
There is an error thrown somewhere when peers are not empty.

julienrbrt · 2025-11-19T14:04:02Z

node/full.go

+func initRaftNode(nodeConfig config.Config, logger zerolog.Logger) (*raftpkg.Node, error) {
+	raftDir := nodeConfig.Raft.RaftDir
+	if raftDir == "" {
+		raftDir = filepath.Join(nodeConfig.RootDir, "raft")


nit: we should be using DefaultConfig() value if empty.

node/failover.go

julienrbrt · 2025-11-19T14:10:39Z

pkg/raft/node.go

+		return fmt.Errorf("not leader")
+	}
+
+	data, err := json.Marshal(state) // todo:use protobuf


why the todo? size?

we should migrate to protobuf here. json will cause overhead, at 100ms we need to minimise it as much as possible

* main: chore: reduce log noise (#2864) fix: sync service for non zero height starts with empty store (#2834) build(deps): Bump golang.org/x/crypto from 0.43.0 to 0.45.0 in /execution/evm (#2861) chore: minor improvement for docs (#2862)

* main: chore: bump da (#2866) chore: bump core (#2865)

* main: chore: fix some comments (#2874) chore: bump node in evm-single (#2875) refactor(syncer,cache): use compare and swap loop and add comments (#2873) refactor: use state da height as well (#2872) refactor: retrieve highest da height in cache (#2870) chore: change from event count to start and end height (#2871)

## Overview Speed up cache write/loads via parallel execution. Pulled from #2836

## Overview Minor updates to make it easier to trace errors Extracted from #2836

* main: chore: remove extra github action yml file (#2882) fix(execution/evm): verify payload status (#2863) feat: fetch included da height from store (#2880) chore: better output on errors (#2879) refactor!: create da client and split cache interface (#2878) chore!: rename `evm-single` and `grpc-single` (#2839) build(deps): Bump golang.org/x/crypto from 0.42.0 to 0.45.0 in /tools/da-debug in the go_modules group across 1 directory (#2876) chore: parallel cache de/serialization (#2868) chore: bump blob size (#2877)

tac0turtle · 2025-11-27T15:33:14Z

block/internal/executing/executor.go


+	// Propose block to raft to share state in the cluster
+	if e.raftNode != nil {
+		headerBytes, err := header.MarshalBinary()


nit: in the flow of this function we decode once and encode twice this data. i wonder if we can make it only decode. This can be a follow up to not inflate this pr.

tac0turtle · 2025-11-27T15:51:32Z

node/light.go

 	database ds.Batching,
 	logger zerolog.Logger,
 ) (ln *LightNode, err error) {
+	p2pClient, err := p2p.NewClient(conf.P2P, nodeKey.PrivKey, database, genesis.ChainID, logger, nil)


what is reasoning behind moving this from the composing part to the constructor?

There is not strong reason for this other than make it consistent with full node constructor.
With full-nodes, the p2p client is setup in the failover.go so that it can be reset when the sync-node becomes leader and peers must be empty.

tac0turtle

overall looks good to me. what sort of latency does the node have for the switch?

* main: build(deps): Bump mdast-util-to-hast from 13.2.0 to 13.2.1 in /docs in the npm_and_yarn group across 1 directory (#2900) refactor(block): centralize timeout in client (#2903) build(deps): Bump the all-go group across 2 directories with 3 updates (#2898) chore: bump default timeout (#2902) fix: revert default db (#2897) refactor: remove obsolete // +build tag (#2899) fix:da visualiser namespace (#2895) refactor: omit unnecessary reassignment (#2892) build(deps): Bump the all-go group across 5 directories with 6 updates (#2881) chore: fix inconsistent method name in retryWithBackoffOnPayloadStatus comment (#2889) fix: ensure consistent network ID usage in P2P subscriber (#2884) build(deps): Bump golangci/golangci-lint-action from 9.0.0 to 9.1.0 (#2885) build(deps): Bump actions/checkout from 5 to 6 (#2886)

alpe added 2 commits November 12, 2025 15:16

Add high availabiliy via raft

d4c7e7d

Prefent duplicate peers

820b9a3

github-project-automation bot added this to Evolve Nov 12, 2025

github-actions bot assigned alpe Nov 12, 2025

alpe added 4 commits November 17, 2025 10:55

Merge branch 'main' into alex/adr023_ha-failover_raft7

35cf0c8

* main: fix: remove duplicate error logging in light node shutdown (#2841) chore: fix incorrect function name in comment (#2840) chore: remove sequencer go.mod (#2837)

Go mod tidy

45a6e27

Minor updates

4562ded

alpe changed the title ~~[WIP] feat: High availabilty via RAFT~~ feat: High availabilty via RAFT Nov 17, 2025

alpe added 4 commits November 18, 2025 11:00

Merge branch 'main' into alex/adr023_ha-failover_raft7

4d47754

Fix test

0804346

Leadership fix

1c2e3f1

(cherry picked from commit c44cd77e665f6d5d463295c6ed61c59a56d88db3)

Fix test

c320a7a

alpe marked this pull request as ready for review November 19, 2025 09:54

alpe added 2 commits November 19, 2025 11:05

Remove sleep

72ec5fe

Increase buffer size for raft msgs

b34d7a0

evstack deleted a comment from claude bot Nov 19, 2025

alpe requested a review from julienrbrt November 19, 2025 11:35

julienrbrt reviewed Nov 19, 2025

View reviewed changes

This was referenced Nov 20, 2025

chore: parallel cache de/serialization #2868

Merged

chore: sync store on new block #2869

Closed

alpe added 3 commits November 20, 2025 17:24

Review feedback

406377e

Merge branch 'main' into alex/adr023_ha-failover_raft7

0ff25b8

* main: chore: bump da (#2866) chore: bump core (#2865)

github-merge-queue bot pushed a commit that referenced this pull request Nov 21, 2025

chore: parallel cache de/serialization (#2868)

1644d37

## Overview Speed up cache write/loads via parallel execution. Pulled from #2836

Add replace to go.mod

406ad8b

alpe mentioned this pull request Nov 21, 2025

chore: better output on errors #2879

Merged

github-merge-queue bot pushed a commit that referenced this pull request Nov 21, 2025

chore: better output on errors (#2879)

82299ed

## Overview Minor updates to make it easier to trace errors Extracted from #2836

alpe added 5 commits November 24, 2025 16:21

Fix merge conflicts

abfa8d3

Linter fix

aebbdff

Minor test update

50d7caf

Remove store.Sync

5de9f0e

tac0turtle reviewed Nov 27, 2025

View reviewed changes

tac0turtle mentioned this pull request Nov 28, 2025

perf: optimize encoding and decoding flows #2894

Open

tac0turtle previously approved these changes Nov 28, 2025

View reviewed changes

alpe added 2 commits December 3, 2025 13:07

Merge

695324e

alpe dismissed tac0turtle’s stale review via 695324e December 3, 2025 12:32

alpe added 2 commits December 3, 2025 17:47

Add last submitted da heights to raft

2463919

Make test compile again

3e1cbca

feat: High availabilty via RAFT #2836

Are you sure you want to change the base?

feat: High availabilty via RAFT #2836

Conversation

alpe commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Nov 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alpe commented Nov 19, 2025

Uh oh!

claude bot commented Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comprehensive Code Review

1. Architecture & Design Patterns

✅ Strengths

⚠️ Issues & Recommendations

2. Error Handling & Resilience

⚠️ Critical Issues

3. Security Considerations

⚠️ Issues

4. Concurrency & Resource Management

⚠️ Issues

5. Performance & Resource Efficiency

⚠️ Issues

6. Testing Coverage & Quality

✅ Strengths

⚠️ Gaps

7. Documentation & Code Quality

⚠️ Issues

8. Specific Code Issues

🔴 Critical

⚠️ High Priority

9. Configuration & Operational Concerns

⚠️ Issues

10. Comparison with Review Comments

Summary & Recommendations

Must Fix Before Merge

High Priority

Nice to Have

Architecture Validation

Uh oh!

julienrbrt left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

julienrbrt Nov 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alpe commented Nov 12, 2025 •

edited

Loading

github-actions bot commented Nov 12, 2025 •

edited

Loading

codecov bot commented Nov 12, 2025 •

edited

Loading

claude bot commented Nov 19, 2025 •

edited

Loading

julienrbrt Nov 19, 2025 •

edited

Loading

tac0turtle Nov 27, 2025 •

edited

Loading

alpe Dec 3, 2025 •

edited

Loading