perf(l1): optimize storage layer for block execution#6207
perf(l1): optimize storage layer for block execution#6207
Conversation
RwLockReadGuard across read transactions instead of re-acquiring per get(). Also replace Mutex with RwLock for trie_cache and last_computed_flatkeyvalue to allow concurrent readers. Combined, these changes reduce block execution time by ~9.3% (57ms → 52ms) on block 24443168 (442 txs, 37.5M gas) benchmarked on a 32-core AMD server.
pre-acquired read view in BackendTrieDB for the entire trie traversal. InMemoryBackend now wraps its Database in Arc, so begin_read() clones the Arc (O(1)) and releases the lock immediately. InMemoryReadTx and InMemoryLocked hold an owned Arc<Database> snapshot — all subsequent gets are lock-free HashMap lookups with no RwLock contention. BackendTrieDB now acquires a single read view in its constructor and reuses it for all get() calls during the trie traversal. This eliminates the per-node-lookup Box allocation and lock acquisition that previously happened ~8000+ times per block. StorageReadView gains Send + Sync bounds and begin_read() returns a 'static view, enabling BackendTrieDB to own the read view.
Change begin_read() to return Arc<dyn StorageReadView> instead of Box, allowing the read view to be shared across multiple BackendTrieDB instances. In get_storage_at_root (the SLOAD hot path), pre-acquire the read view, trie cache, and last_written once and share them between the state trie and storage trie opens. This eliminates per-query duplicate RwLock acquisitions and Arc allocations.
apply_prefix was called on every trie node lookup, creating 3 Vec allocations each time (from_bytes + append_new + concat). Pre-computing the prefix nibbles once in the TrieWrapper constructor reduces this to 1 allocation (just concat) per lookup. Also adds TrieWrapper::new() constructor to encapsulate this.
🤖 Kimi Code ReviewReview SummaryThis PR introduces significant performance optimizations to the storage layer, primarily through:
Issues Found1. Critical: Atomicity Violation in InMemoryBackend (in_memory.rs:33-34, 153-154, 168-169)The use of // Problem: Each write creates a new Arc, breaking atomicity
let db_mut = Arc::make_mut(&mut *db);Fix: Use 2. Performance: Unnecessary Arc Cloning (in_memory.rs:44-48)The in-memory backend clones the entire database on every read transaction, which is O(n) complexity. This defeats the purpose of the optimization: let snapshot = self.inner.read()?.clone(); // Clones entire databaseFix: Return a reference-based read view that holds the RwLock guard, or use a proper MVCC implementation. 3. Memory Safety: Lifetime Mismatch (api/mod.rs:37)The trait signature change from 4. Code Duplication (store.rs:2422-2599)Multiple nearly identical 5. Error Handling Inconsistency (layering.rs:195-210)The Recommendations
The RocksDB backend changes look correct as RocksDB handles its own MVCC. Automated review by Kimi (Moonshot AI) · custom prompt |
🤖 Codex Code ReviewFindings
Notes / Questions
Positive
If you want, I can scan for any callers that depend on deterministic prefix iteration order to confirm whether Point 1 is a real behavioral change. Automated review by OpenAI Codex · custom prompt |
🤖 Claude Code ReviewNow I have enough context to provide a comprehensive code review. Let me compile my findings: PR Review: Performance Optimization of Storage Layer (-15.5%)SummaryThis PR implements four incremental optimizations to the storage hot path, targeting lock contention and allocation overhead identified through profiling. The changes are well-motivated and demonstrate clear performance wins (15.5% speedup, 18.2% throughput increase). Critical Issues1. Type inconsistency in InMemoryBackend (crates/storage/backend/in_memory.rs:11)The outer type Table = FxHashMap<Vec<u8>, Vec<u8>>;
type Database = HashMap<&'static str, Table>; // ← Standard HashMap, not FxHashMapIssue: The outer type Database = FxHashMap<&'static str, Table>;Impact: Minor performance regression for table name lookups, though table count is small so impact is minimal in practice. 2. Potential Arc clone overhead in get_storage_at_root (crates/storage/store.rs:2104-2106, 2118-2120)The code clones read_view.clone(),
cache.clone(),
last_written.clone(), // ← This is Vec<u8>, not Arc!Issue: Recommendation: Change Impact: Moderate - each 3. Missing Send + Sync bounds verification (crates/storage/api/mod.rs:58)The trait now requires pub trait StorageReadView: Send + Sync {Verification needed: Ensure all concrete implementations (
Status: Appears correct, but worth confirming with a compile-time assertion or test if not already verified. Performance Concerns4. Arc::make_mut potential COW overhead (crates/storage/backend/in_memory.rs:33, 153, 169)The RCU pattern uses let db_mut = Arc::make_mut(&mut *db);Concern: If any reader still holds a reference to the inner Analysis: The design assumes writes are infrequent (background worker only, 2-3 times per block). Readers clone the inner Arc and release the RwLock immediately, so the window for contention is small. However, if a reader holds the snapshot for a long time while a write occurs, the entire DB gets cloned. Recommendation: Add a comment documenting this behavior and the assumption that reads are short-lived. Consider profiling to verify COW isn't triggered frequently in practice. 5. Prefix nibble computation still allocates (crates/storage/layering.rs:207-208, 233-236, 241-244)The optimization pre-computes prefix nibbles, but the application still allocates: let key = match &self.prefix_nibbles {
Some(prefix) => prefix.concat(&key), // ← Still allocates a new Nibbles
None => key,
};Issue: Potential optimization: If Impact: Lower priority - already reduced by 67%, further optimization requires trie library changes. Code Quality Issues6. Redundant field in BackendTrieDB (crates/storage/trie.rs:14-15)The /// Reference to the storage backend (used only for writes)
db: Arc<dyn StorageBackend>,
/// Pre-acquired read view held for the lifetime of this struct.
read_view: Arc<dyn StorageReadView>,Question: Is Looking at usage: 7. Removed apply_prefix usage not fully deleted (crates/storage/trie.rs:6, 111)The import and function are still present but largely unused: use crate::layering::apply_prefix; // Line 6
// ...
fn make_key(&self, path: Nibbles) -> Vec<u8> {
apply_prefix(self.address_prefix, path).into_vec() // Line 111
}While
Correctness8. RwLock poisoning behavior (crates/storage/store.rs:2093-2094)Lock errors map to .read()
.map_err(|_| StoreError::LockError)?Question: What happens if a thread panics while holding a write lock? The RwLock will be poisoned, and all subsequent operations will fail. Is this the desired behavior, or should poisoned locks be recovered? Analysis: For blockchain storage, fail-fast on poisoned locks is likely correct - if a storage write panicked, the database is potentially in an inconsistent state. The current behavior seems appropriate. 9. Trie cache RCU consistency (crates/storage/store.rs:2759-2760)During trie updates, the cache is updated with RCU: let trie = Arc::new(trie_mut);
*trie_cache.write().map_err(|_| StoreError::LockError)? = trie.clone();Observation: After the write lock is released, readers will see the new cache state. There's a small window where readers might miss cache entries if they grabbed the old cache before the update but perform lookups after the database is updated. Analysis: This is inherent to the RCU pattern and acceptable - readers will simply hit the database instead of the cache, which is correct (slower but not incorrect). The tradeoff is worth it for lock-free reads. Minor Issues10. Missing type documentation (crates/storage/backend/in_memory.rs:10-11)The type aliases should document why FxHashMap is chosen: // Using FxHashMap for O(1) lookups vs BTreeMap's O(log n).
// FxHash is faster than SipHash for small keys (byte vectors, table names).
type Table = FxHashMap<Vec<u8>, Vec<u8>>;
type Database = FxHashMap<&'static str, Table>; // Fixed: use FxHashMap here too11. TrieWrapper::new could take ownership (crates/storage/layering.rs:201-215)The constructor currently takes owned pub fn new(
state_root: H256,
inner: Arc<TrieLayerCache>,
db: Box<dyn TrieDB>,
prefix: Option<H256>,
) -> Self {
let prefix_nibbles = prefix.map(|p| Nibbles::from_bytes(p.as_bytes()).append_new(17));
// ...
}This is fine, but the Positive Observations
RecommendationsMust fix:
Should fix: Nice to have: Future optimization: Security AssessmentNo security vulnerabilities identified. The changes maintain the same correctness guarantees as before:
VerdictAPPROVE with minor fixes requested (Item 1 and optionally Item 2). The performance gains are significant and well-justified. The implementation is sound with only minor consistency issues to address. Automated review by Claude (Anthropic) · custom prompt |
Lines of code reportTotal lines added: Detailed view |
Greptile OverviewGreptile SummaryThis PR implements four incremental storage layer optimizations that collectively achieve a 15.5% reduction in block execution time through targeted elimination of lock contention, allocation overhead, and algorithmic complexity. Key Changes:
Performance Impact: Confidence Score: 4/5
|
| Filename | Overview |
|---|---|
| crates/storage/backend/in_memory.rs | Implemented RCU pattern with Arc<RwLock<Arc<Database>>>, replaced BTreeMap with FxHashMap for O(1) lookups, snapshots now clone inner Arc for lock-free reads |
| crates/storage/layering.rs | Pre-computes prefix nibbles in TrieWrapper::new() constructor to avoid repeated allocations on every trie node lookup, reducing allocations from 3 to 1 per lookup |
| crates/storage/store.rs | Changed trie_cache and last_computed_flatkeyvalue from Mutex to RwLock, added shared read view pattern in get_storage_at_root, new *_with_view methods enable resource sharing across multiple trie opens |
| crates/storage/trie.rs | Added read_view field to BackendTrieDB to hold pre-acquired read view, eliminating per-lookup allocations, new *_with_view constructors support sharing a single read view across multiple trie instances |
Sequence Diagram
sequenceDiagram
participant App as Block Executor
participant Store as Store
participant Backend as StorageBackend
participant TrieDB as BackendTrieDB
participant Cache as TrieLayerCache
Note over App,Cache: Optimized get_storage_at_root flow
App->>Store: get_storage_at_root(state_root, address, storage_key)
Note over Store: Pre-acquire shared resources (optimization #2 & #3)
Store->>Backend: begin_read()
Backend-->>Store: Arc<StorageReadView> (cloneable snapshot)
Store->>Cache: trie_cache.read() (RwLock instead of Mutex)
Cache-->>Store: Arc<TrieLayerCache> clone
Store->>Store: last_written() (RwLock.read())
Note over Store: Open state trie with shared resources
Store->>TrieDB: new_for_accounts_with_view(backend, read_view, last_written)
Note over TrieDB: Holds read_view for entire traversal (no per-lookup allocation)
Store->>TrieDB: state_trie.get(account_hash)
Note over TrieDB: Pre-computed prefix nibbles (optimization #4)
TrieDB->>Cache: inner.get(state_root, prefixed_key)
Cache-->>TrieDB: cache hit/miss
alt cache miss
TrieDB->>Backend: read_view.get() (lock-free with RCU)
Backend-->>TrieDB: trie node data
end
TrieDB-->>Store: encoded_account
Note over Store: Open storage trie reusing same read_view
Store->>TrieDB: new_for_storages_with_view(backend, read_view.clone(), last_written)
Store->>TrieDB: storage_trie.get(hashed_key)
Note over TrieDB: Reuses held read_view (no new snapshot)
TrieDB->>Cache: inner.get(state_root, prefixed_key)
alt cache miss
TrieDB->>Backend: read_view.get() (same snapshot)
Backend-->>TrieDB: storage value
end
TrieDB-->>Store: storage_value
Store-->>App: U256 result
Note over App,Cache: Key optimizations:<br/>1. BTreeMap→FxHashMap (O(log n)→O(1))<br/>2. RCU pattern (lock-free reads)<br/>3. Shared read_view (no duplicate snapshots)<br/>4. Pre-computed prefix (1 alloc vs 3)
Last reviewed commit: d4555a6
| .map_err(|_| StoreError::Custom("Failed to acquire write lock".to_string()))?; | ||
|
|
||
| let table_ref = db.entry(table).or_insert_with(Table::new); | ||
| let db_mut = Arc::make_mut(&mut *db); |
There was a problem hiding this comment.
Arc::make_mut will clone the entire database if any read snapshots still hold references to the inner Arc<Database>. Under high concurrency (many concurrent get_storage_at_root calls holding read views), this could cause unexpected memory spikes and copy overhead on writes. Consider documenting this trade-off or monitoring memory usage in production.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/storage/backend/in_memory.rs
Line: 153:153
Comment:
`Arc::make_mut` will clone the entire database if any read snapshots still hold references to the inner `Arc<Database>`. Under high concurrency (many concurrent `get_storage_at_root` calls holding read views), this could cause unexpected memory spikes and copy overhead on writes. Consider documenting this trade-off or monitoring memory usage in production.
<sub>Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!</sub>
How can I resolve this? If you propose a fix, please make it concise.| read_view.clone(), | ||
| cache.clone(), | ||
| last_written.clone(), |
There was a problem hiding this comment.
Good optimization - cloning Arc is cheap (atomic increment) while avoiding duplicate RwLock acquisitions and snapshot creation.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/storage/store.rs
Line: 2104:2106
Comment:
Good optimization - cloning `Arc` is cheap (atomic increment) while avoiding duplicate `RwLock` acquisitions and snapshot creation.
How can I resolve this? If you propose a fix, please make it concise.| let prefix_nibbles = | ||
| prefix.map(|p| Nibbles::from_bytes(p.as_bytes()).append_new(17)); |
There was a problem hiding this comment.
Excellent optimization - pre-computing the prefix once eliminates 3 allocations (from_bytes + append_new + concat) on every get() call, keeping only the final concat.
Prompt To Fix With AI
This is a comment left during a code review.
Path: crates/storage/layering.rs
Line: 207:208
Comment:
Excellent optimization - pre-computing the prefix once eliminates 3 allocations (`from_bytes` + `append_new` + `concat`) on every `get()` call, keeping only the final `concat`.
How can I resolve this? If you propose a fix, please make it concise.There was a problem hiding this comment.
Pull request overview
This PR optimizes the storage layer for block execution through four targeted performance improvements, achieving a 15.5% reduction in execution time. The optimizations target bottlenecks identified via profiling: data structure access patterns (BTreeMap→FxHashMap), lock contention (Mutex→RwLock), per-lookup allocations (RCU pattern + held read views), and repeated prefix computations.
Changes:
- Replaced Mutex with RwLock for read-heavy
trie_cacheandlast_computed_flatkeyvalueto enable concurrent readers - Implemented RCU pattern in InMemoryBackend with Arc<RwLock<Arc>> for lock-free reads via snapshots
- Modified storage API to return
Arc<dyn StorageReadView>instead of Box, enabling cheap sharing of read views across multiple trie operations - Added constructors for BackendTrieDB that accept pre-acquired shared read views to eliminate per-trie-open allocations
- Pre-computed prefix nibbles in TrieWrapper constructor to reduce allocations from 3 to 1 per trie node lookup
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| crates/storage/api/mod.rs | Changed StorageBackend::begin_read() to return Arc instead of Box, added Send + Sync bounds to StorageReadView trait |
| crates/storage/backend/rocksdb.rs | Updated begin_read() to return Arc |
| crates/storage/backend/in_memory.rs | Implemented RCU pattern with Arc<RwLock<Arc>>, switched Table from BTreeMap to FxHashMap, updated begin_read() for snapshot-based lock-free reads |
| crates/storage/trie.rs | Added BackendTrieDB constructors with shared read view support (*_with_view variants), changed read view field to Arc for sharing |
| crates/storage/layering.rs | Added TrieWrapper::new() constructor with pre-computed prefix nibbles, replaced apply_prefix calls with direct concat operations |
| crates/storage/store.rs | Converted trie_cache and last_computed_flatkeyvalue from Mutex to RwLock, added *_shared trie opening methods, updated get_storage_at_root to pre-acquire and share resources, updated all TrieWrapper instantiations to use new constructor |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Motivation
Profiling block execution on mainnet revealed that lock contention, per-lookup allocations, and O(log n) data structure access in the storage layer accounted for a significant portion of execution time. The storage hot path (
get_storage_at_root→ trie lookups) was hitting multiple bottlenecks on every trie node access.Description
Four incremental optimizations to the storage layer, each targeting a specific bottleneck identified via
perfprofiling on a 32-core AMD server:1. BTreeMap → FxHashMap + Mutex → RwLock
BTreeMap<Vec<u8>, Vec<u8>>withFxHashMapfor in-memory storage tables — O(1) lookups instead of O(log n)MutexwithRwLockfortrie_cacheandlast_computed_flatkeyvalue— allows concurrent readers (only the background worker writes, 2-3 times per block)2. RCU pattern + held read view
InMemoryBackend: wrap the database asArc<RwLock<Arc<Database>>>so readers clone the innerArc(O(1)) and read entirely lock-freeBackendTrieDB: acquire the read view once at construction and reuse it for all trie node lookups, eliminating ~8000+Boxallocations per block3. Shared read view across trie opens
begin_read()to returnArc<dyn StorageReadView>instead ofBox<dyn StorageReadView>, enabling cheapArc::clone()sharingget_storage_at_root, pre-acquire the read view, trie cache, and last_written value once, then share across both the state trie and storage trie opens — eliminates duplicateRwLockacquisitionsStorageReadView: Send + Syncbound (required forArcsharing)4. Pre-compute prefix nibbles in TrieWrapper
apply_prefixwas called on every trie node lookup, creating 3Vecallocations per call (from_bytes+append_new+concat)TrieWrapperconstructor, reducing to 1 allocation (justconcat) per lookupTrieWrapper::new()constructor to encapsulate prefix pre-computationBenchmark Results
30 runs each, back-to-back on same server (32-core AMD, ethrex-office-4):
14-15% improvement on heavy blocks, 8% on light blocks. Lock contention dropped from 8.1% to 1.4% of block executor time.
How to Test
cargo test -p ethrex-storagecargo test -p ethrex-blockchaincargo clippy -p ethrex-storage -p ethrex-blockchain -- -D warnings