-
Notifications
You must be signed in to change notification settings - Fork 240
feat: forced inclusion #2797
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: forced inclusion #2797
Conversation
|
The latest Buf updates on your PR. Results from workflow CI and Release / buf-check (pull_request).
|
|
Claude finished @julienrbrt's task —— View job Comprehensive Code Review CompleteI've completed a thorough review of the forced inclusion mechanism implementation. This is an impressive feature with strong architectural decisions. Below are my findings organized by focus area. Executive SummaryOverall Assessment: This is a well-designed implementation of a critical censorship-resistance feature. The code demonstrates good Go practices, thoughtful error handling, and comprehensive testing. However, there are several areas requiring attention before production deployment, particularly around concurrency safety, resource management, and handling edge cases. Key Strengths:
Critical Issues: 2 high-priority, 5 medium-priority 1. Code Quality & Architecture ⭐⭐⭐⭐Strengths
Issues🔴 CRITICAL: Non-deterministic timestamp in based sequencerLocation: Timestamp: time.Time{}, // TODO(@julienrbrt): we need to use DA block timestamp for determinismImpact: This breaks consensus determinism. All nodes must produce identical blocks. Recommendation: // Use DA block timestamp from the forced inclusion event
Timestamp: forcedTxsEvent.Timestamp, // Fetch from DA block metadataThis is already noted as a TODO but is blocking for production use. The DA layer should provide block timestamps. 🟡 MEDIUM: Memory leak risk in based sequencer txQueueLocation: txQueue [][]byte // Unbounded sliceIssue: The Recommendation: Add a maximum queue size with metrics: const MaxTxQueueSize = 10000
if len(s.txQueue) > MaxTxQueueSize {
s.logger.Warn().Msg("Transaction queue full, dropping oldest transactions")
s.txQueue = s.txQueue[len(s.txQueue)-MaxTxQueueSize:]
}🟡 MEDIUM: pendingForcedInclusionTxs not persistedLocation: Issue: In-memory pending queue is lost on restart. This could cause valid forced inclusion transactions to be forgotten. Recommendation:
// Add persistence layer
func (s *Sequencer) persistPendingTxs() error {
data, err := json.Marshal(s.pendingForcedInclusionTxs)
if err != nil {
return err
}
return s.db.Put(pendingTxsKey, data)
}🟢 MINOR: Epoch calculation could be optimizedLocation: epochStart, epochEnd, currentEpochNumber := types.CalculateEpochBoundaries(daHeight, r.genesis.DAStartHeight, r.daEpochSize)Issue: Recommendation: Return epoch number from line 54 if needed elsewhere, or accept that this minor recalculation is acceptable for code clarity. 2. Error Handling & Resilience ⭐⭐⭐⭐Strengths
Issues🟡 MEDIUM: Error handling inconsistency in forced inclusion retrievalLocation: // any error during process, need to retry at next call
if processErrs != nil {
r.logger.Warn().
// ...
Msg("Failed to retrieve DA epoch.. retrying next iteration")
return &ForcedInclusionEvent{
StartDaHeight: daHeight,
EndDaHeight: daHeight,
Txs: [][]byte{},
}, nil // Returns nil error even though processErrs != nil
}Issue: Swallows errors silently. Caller cannot distinguish between "no forced txs" and "retrieval failed". Recommendation: Return the error or add a status field to type ForcedInclusionEvent struct {
StartDaHeight uint64
EndDaHeight uint64
Txs [][]byte
Partial bool // Indicates incomplete retrieval
}🟢 MINOR: Context cancellation not checked in loopsLocation: for epochHeight := epochStart + 1; epochHeight < epochEnd; epochHeight++ {
result := r.client.RetrieveForcedInclusion(ctx, epochHeight)
// ...
}Recommendation: Add context check for long-running epoch scans: for epochHeight := epochStart + 1; epochHeight < epochEnd; epochHeight++ {
select {
case <-ctx.Done():
return nil, ctx.Err()
default:
}
// ...
}3. Security ⭐⭐⭐⭐⭐Strengths
Issues🟢 MINOR: No rate limiting on forced inclusion namespaceLocation: Architecture level Issue: While DA layer has native spam protection, there's no application-level rate limiting or prioritization for forced inclusion transactions. Recommendation: Consider future enhancement to add:
This is not blocking for initial release but should be documented as a known limitation. 🟢 MINOR: Hash collision risk (theoretical)Location: func hashTx(tx []byte) string {
hash := sha256.Sum256(tx)
return hex.EncodeToString(hash[:])
}Issue: Using SHA256 is fine for transaction hashing, but the implementation doesn't handle the theoretical collision risk in the Recommendation: Current implementation is acceptable. SHA256 collision resistance is sufficient for this use case. Document that this is by design. 4. Performance & Resource Efficiency ⭐⭐⭐⭐Strengths
Issues🟡 MEDIUM: Sequential DA fetching within epochsLocation: for epochHeight := epochStart + 1; epochHeight < epochEnd; epochHeight++ {
result := r.client.RetrieveForcedInclusion(ctx, epochHeight)
err = r.processForcedInclusionBlobs(event, result, epochHeight)
processErrs = errors.Join(processErrs, err)
}Issue: DA fetches are sequential, not parallel. For large epochs, this adds latency. Recommendation: Use errgroup for parallel fetching: g, ctx := errgroup.WithContext(ctx)
results := make([]coreda.ResultRetrieve, epochEnd-epochStart+1)
for i, height := range heightRange {
i, height := i, height // Capture loop variables
g.Go(func() error {
results[i] = r.client.RetrieveForcedInclusion(ctx, height)
return nil
})
}
g.Wait()Note: This is mentioned as a follow-up in PR comments (#2842), which is good planning. 🟢 MINOR: Redundant DA height loadingLocation: s.daRetrieverHeight.Store(max(s.genesis.DAStartHeight, s.cache.DaHeight(), state.DAHeight, s.getHighestStoredDAHeight()))Issue: Recommendation: Cache the result or only compute when necessary. Consider adding a flag to skip this check after initial startup. 5. Testing & Reliability ⭐⭐⭐⭐Strengths
Issues🟡 MEDIUM: Missing edge case testsLocation: Test coverage gaps Missing test scenarios:
Recommendation: Add these test cases before production deployment: func TestSequencer_EpochBoundaryTransition(t *testing.T) {
// Test when GetNextBatch is called at exact epoch boundary
}
func TestSequencer_ConcurrentDAHeightAccess(t *testing.T) {
// Use race detector to verify atomic access patterns
}🟢 MINOR: Test cleanupLocation: Various test files Some tests don't properly clean up resources (contexts, goroutines). Run tests with 6. Documentation & Developer Experience ⭐⭐⭐⭐⭐Strengths
Issues🟢 MINOR: Missing operational guidanceLocation: Documentation Recommendation: Add operational runbook covering:
Example addition to docs: ## Operational Monitoring
### Metrics to Monitor
- `forced_inclusion_pending_count`: Should stay < 100 under normal operation
- `forced_inclusion_epoch_size_bytes`: Should not consistently exceed 50% of MaxBytes
- `forced_inclusion_retrieval_errors`: Should be 0 under normal operation
### Alert Thresholds
- WARN: pending_count > 100 for > 5 minutes
- CRITICAL: pending_count > 1000 or retrieval_errors > 10/hour7. Concurrency & Race Conditions ⭐⭐⭐⭐Strengths
Issues🟡 MEDIUM: Potential race in pendingForcedInclusionTxs (single sequencer)Location: type Sequencer struct {
// ...
pendingForcedInclusionTxs []pendingForcedInclusionTx // Not protected by mutex
}
func (s *Sequencer) processForcedInclusionTxs(...) {
// Reads and writes to pendingForcedInclusionTxs without synchronization
c.pendingForcedInclusionTxs = newPendingTxs
}Issue: If Recommendation: Either:
type Sequencer struct {
mu sync.Mutex
pendingForcedInclusionTxs []pendingForcedInclusionTx
}Based on the architecture, option 1 is likely sufficient, but it should be documented. 8. Configuration & Deployment ⭐⭐⭐⭐Strengths
Issues🟢 MINOR: No validation for epoch size = 0Location: if daEpochSize == 0 {
return 1 // All blocks in epoch 1
}Issue: Epoch size of 0 is a special case but isn't validated at config parsing time. Could lead to confusion. Recommendation: Add validation in genesis loading: func (g Genesis) Validate() error {
if g.DAEpochForcedInclusion == 0 {
return errors.New("DAEpochForcedInclusion must be >= 1")
}
// ...
}Summary of RecommendationsMust Fix Before Production
Should Fix Soon
Nice to Have
ConclusionThis is a well-engineered feature that demonstrates strong architectural thinking and good Go practices. The forced inclusion mechanism provides genuine censorship resistance while maintaining performance through clever epoch-based fetching. The implementation is nearly production-ready with a few important fixes needed:
Once these issues are addressed and the follow-up tasks completed (async fetching, smoothing refinements), this will be a robust and valuable feature for the ev-node framework. Recommendation: Address the critical timestamp issue and add persistence for pending transactions, then merge. The remaining improvements can be handled in follow-up PRs as already planned by the team. Great work on this complex feature! 👏 |
alpe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice start!
Can you elaborate why you decided on a separate namespace for the force inclusion TX? The consumers have to read both ns anyway to stay up to date
|
|
||
| event.StartDaHeight = epochHeight | ||
| event.Txs = append(event.Txs, result.Data...) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to prepare for malicious content. let's exit the loop early when a tx size threshold is reached. This can be a multiple of common.DefaultMaxBlobSize used by the executor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense for the height check yes!. However i was thinking of doing no other checks and let the execution client deal with gibberish data (this is why i added that as requirement in the execution interface description)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to keep raw TX data in the namespace, there is not much we can do here to validate, indeed. A size check is an easy win but more would require extending the executor interface for a checkTX.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i agree, and this actually may be required to avoid congestions issue and losing txs.
This was a suggestion. Personally I think it makes sense, as we are filtering what's coming up in that namespace at fetching level directly in ev-node. What is posted in the force included namespace is handled directly by the execution client. ev-node only pass down bytes. |
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2797 +/- ##
==========================================
+ Coverage 64.53% 65.43% +0.90%
==========================================
Files 81 85 +4
Lines 7370 7838 +468
==========================================
+ Hits 4756 5129 +373
- Misses 2072 2151 +79
- Partials 542 558 +16
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
The latest Buf updates on your PR. Results from workflow CI / buf-check (pull_request).
|
|
List of improvements to do in follow-ups:
|
|
We discussed the above in the standup (#2797 (comment)), and a few ideas came. 1 - 2 . When making the call async, we need to make sure the executor and full node stay insync with an epoch. This can be done easily by making an epoch a few blocks behind the actual DA height.
|
alpe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for answering all my questions and comments.
There is still the todo in the code to store unprocessed direct TX when the max block size is reached.
|
|
||
| event.StartDaHeight = epochHeight | ||
| event.Txs = append(event.Txs, result.Data...) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to keep raw TX data in the namespace, there is not much we can do here to validate, indeed. A size check is an easy win but more would require extending the executor interface for a checkTX.
we decided to remove the sequencer go.mod, as ev-node can provide directly the sequencer implementation (sequencers/single was already depending on ev-node anyway) this means no go.mod need to be added for the new based sequencers in #2797
|
Once is PR is merged, we should directly after:
In the meantime, I have disabled the feature so it can be merged (0d790ef) |
|
FYI the upgrade test will fail until tastora is updated. |
sequencers/based/sequencer.go
Outdated
|
|
||
| return &coresequencer.GetNextBatchResponse{ | ||
| Batch: batch, | ||
| Timestamp: time.Now(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not deterministic for all nodes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this isn't really an issue, as every node is the sequencer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This timestamp is used for the headerTime of the next block. This will lead to a different hash for the block. The other thing is that app logic on the chain may use this value in their decision tree or store it. State could diverge on the nodes which makes it hard to recover later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Then we need to use the time of the day block, as the block producing time of a based sequencer can never be in sync across all nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Then we need to use the time of the day block, as the block producing time of a based sequencer can never be in sync across all nodes.
|
Some of the changes were going to be tackled as follow-ups (the congestion issue, async fetching, commands) as it was getting hard to review this. This is why the feature cannot be enabled yet. There's still code missing in the execution client as well to get it all working. I'll check the other comments. |
|
To recap everything that needs to happen in follow ups:
Most of them are small and contained. |
alpe
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your comments and follow up task list.
Let's bring this into main when CI is happy again
ref: #1914
A choice has been made to make this logic in the executor and avoid extending the reaper and the sequencer.
This is because, updating the repeer, means passing down the last fetched da height accross all components.
It adds a lot of complexity otherwise. Adding it in the sequencer may be preferable, but this makes the inclusion in a sync node less straightforward. This is what is being investigated.
Compared to the previous implementation, a forced transaction does not have any structure. It should be the raw structure from the execution client. This is to keep ev-node know nothing about the transaction. No signature checks, no validation of correctness. The execution client must make sure to reject gibberish transactions.
---- for later, won't be included in this pr (ref #2797 (comment))