Skip to content

perf(storage): use two-level index with partitioned filters (RocksDB)#6196

Draft
azteca1998 wants to merge 1 commit intomainfrom
add-two-level-index
Draft

perf(storage): use two-level index with partitioned filters (RocksDB)#6196
azteca1998 wants to merge 1 commit intomainfrom
add-two-level-index

Conversation

@azteca1998
Copy link
Contributor

Enable TwoLevelIndexSearch with partitioned filters on all column families. This splits large index and filter blocks into smaller 4 KB partitions with a top-level index, so only the partitions needed for a lookup are loaded into cache rather than the entire index block.

Configuration applied to all CFs via a shared helper closure:

  • TwoLevelIndexSearch index type
  • Partitioned filters enabled
  • 4 KB metadata block size
  • Index/filter blocks cached in block cache (bounded memory)
  • Top-level index and L0 filter/index blocks pinned in cache

This significantly reduces memory overhead for large SST files (256-512 MB targets in this config) and improves block cache efficiency by avoiding cache pollution from oversized index blocks.

No resync required — RocksDB applies the new index format to SST files produced by subsequent compactions.

Motivation

Description

Checklist

  • Updated STORE_SCHEMA_VERSION (crates/storage/lib.rs) if the PR includes breaking changes to the Store requiring a re-sync.

Enable TwoLevelIndexSearch with partitioned filters on all column
families. This splits large index and filter blocks into smaller 4 KB
partitions with a top-level index, so only the partitions needed for
a lookup are loaded into cache rather than the entire index block.

Configuration applied to all CFs via a shared helper closure:
- TwoLevelIndexSearch index type
- Partitioned filters enabled
- 4 KB metadata block size
- Index/filter blocks cached in block cache (bounded memory)
- Top-level index and L0 filter/index blocks pinned in cache

This significantly reduces memory overhead for large SST files
(256-512 MB targets in this config) and improves block cache
efficiency by avoiding cache pollution from oversized index blocks.

No resync required — RocksDB applies the new index format to SST
files produced by subsequent compactions.
@github-actions github-actions bot added the performance Block execution throughput and performance in general label Feb 12, 2026
@github-actions
Copy link

🤖 Kimi Code Review

Review Summary

This PR introduces partitioned index/filter blocks for RocksDB to improve performance with large datasets. The changes are well-structured and follow good practices.

Issues Found

1. Potential Performance Regression (Line 87-92)
The set_partitioned_index function applies the same configuration to all column families, but some CFs may not benefit from partitioned indexes. Specifically:

  • Receipts and Transactions tables (lines 165-175, 177-187) use blob storage - partitioned indexes might add overhead for blob-index lookups
  • Headers table (lines 103-112) has large 32KB blocks which may not benefit from partitioning

2. Missing Configuration Validation
The PR doesn't validate if the hardware has sufficient memory for the additional caching enabled by:

  • set_cache_index_and_filter_blocks(true)
  • set_pin_top_level_index_and_filter(true)
  • set_pin_l0_filter_and_index_blocks_in_cache(true)

These settings increase memory usage and could cause OOM on memory-constrained systems.

3. Magic Numbers Without Documentation

  • Line 89: set_metadata_block_size(4096) - This 4KB metadata block size should be documented as it's critical for performance tuning
  • The choice of 4096 bytes isn't justified in the context of Ethereum's data patterns

Suggestions

  1. Conditional Partitioning: Consider applying partitioned indexes only to CFs that will clearly benefit:

    match table {
        "State" | "Storage" | "Trie" => set_partitioned_index(&mut block_opts),
        _ => {} // Skip for blob-based tables
    }
  2. Add Configuration Comments: Document the rationale for 4096 byte metadata blocks and the memory implications of the caching settings.

  3. Consider Making Configurable: Expose these settings via configuration rather than hardcoding, allowing operators to tune based on their hardware.

The code is otherwise well-structured and follows Rust best practices. The use of a closure to avoid repetition is clean and maintainable.


Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

🤖 Claude Code Review

Now let me provide my code review:

Code Review: PR #6196 - Two-Level Index with Partitioned Filters

Summary

This PR adds two-level indexing with partitioned filters to RocksDB configuration across all column families. The implementation is straightforward and the optimization is well-motivated for large SST files in blockchain storage scenarios.

✅ Positive Aspects

  1. Good abstraction: The set_partitioned_index closure avoids code duplication across all 7 column family configurations.

  2. Conservative defaults: 4KB metadata block size is a sensible choice that balances partition granularity with overhead.

  3. Correct pinning strategy: Pinning top-level index and L0 filter/index blocks in cache is appropriate for frequently accessed metadata.

  4. No schema breaking changes: Correctly noted that no resync is required since RocksDB applies this to new SST files during compaction.

  5. Clear documentation: The inline comment at lines 87-89 explains the purpose well.

⚠️ Issues and Recommendations

Critical: Missing Block Cache Configuration

Severity: High

The configuration enables set_cache_index_and_filter_blocks(true) (line 94) but no block cache is configured anywhere in this file. This means:

  • Index and filter blocks will use an unbounded default cache
  • Memory usage is unpredictable and potentially unlimited
  • The main benefit of partitioned filters (bounded memory) is partially negated

Recommendation:

// Add before line 99 (before cf_descriptors loop)
let block_cache = rocksdb::Cache::new_lru_cache(512 * 1024 * 1024); // 512MB shared cache

// Then modify set_partitioned_index closure to accept and use it:
let set_partitioned_index = |block_opts: &mut BlockBasedOptions| {
    block_opts.set_block_cache(&block_cache);
    block_opts.set_index_type(BlockBasedIndexType::TwoLevelIndexSearch);
    block_opts.set_partition_filters(true);
    block_opts.set_metadata_block_size(4096);
    block_opts.set_cache_index_and_filter_blocks(true);
    block_opts.set_pin_top_level_index_and_filter(true);
    block_opts.set_pin_l0_filter_and_index_blocks_in_cache(true);
};

Location: crates/storage/backend/rocksdb.rs:90-97

Performance: Metadata Block Size Trade-offs

Severity: Medium

The 4KB metadata_block_size (line 93) creates many small partitions. For Ethereum storage with large state tries:

  • Trie nodes CFs (lines 135-147, 148-160): These have 256MB target file sizes and frequent random access patterns. Larger partitions (8-16KB) might reduce top-level index overhead while still maintaining good granularity.

  • Sequential CFs like HEADERS/BODIES (lines 114-123): 4KB might be too small for sequential scans, causing unnecessary partition switches.

Recommendation: Consider tuning per column family:

block_opts.set_metadata_block_size(16 * 1024); // For trie nodes
block_opts.set_metadata_block_size(8 * 1024);  // For headers/bodies

Location: crates/storage/backend/rocksdb.rs:93

Minor: Inconsistent Configuration

Severity: Low

Some column families already have set_bloom_filter configured (lines 132, 145, 158) while others don't. With partitioned filters, bloom filters become even more effective. Consider adding bloom filters to HEADERS, BODIES, RECEIPTS, and ACCOUNT_CODES CFs for improved point lookup performance.

Location: crates/storage/backend/rocksdb.rs:119-122, 171-174, 181-184

Code Quality: Closure Mutability

Severity: Info

The closure at line 90 takes &mut BlockBasedOptions but could potentially be called multiple times with the same instance if there are bugs in the match arms. This is currently safe but fragile.

Recommendation: Consider making it a regular function for clarity:

fn configure_partitioned_index(block_opts: &mut BlockBasedOptions) {
    // ... configuration
}

🔍 Ethereum-Specific Considerations

  1. State trie operations: The partitioned index should help with the random access patterns typical of Ethereum state lookups. ✅

  2. Sync performance: During initial sync, the cache pressure from index blocks can be significant. This change should help by not loading entire index blocks into cache. ✅

  3. Disk space: Two-level indexes add minor overhead (~1-2%) to SST file sizes. This is acceptable for the memory benefits. ✅

📊 Testing Recommendations

  1. Monitor memory usage during full sync and normal operation
  2. Benchmark random state access performance (account reads, storage reads)
  3. Verify compaction behavior with rocksdb.stats enabled
  4. Test block cache hit rates with different cache sizes

Verdict

Approve with modifications: The PR is well-intentioned and mostly correct, but requires adding a block cache configuration to fully realize the benefits. Without it, the memory management improvements are incomplete.

The missing block cache is a significant oversight for a blockchain storage system where unbounded memory growth is unacceptable.


Automated review by Claude (Anthropic) · custom prompt

@github-actions
Copy link

🤖 Codex Code Review

Review Summary

Small, focused change. No EVM/consensus/state/transaction logic touched. Main impact is RocksDB table configuration for all CFs. Code is clean and idiomatic.

Findings

  1. Potential memory pressure from pinned index/filter blocks
    In crates/storage/backend/rocksdb.rs:87-97, set_pin_top_level_index_and_filter(true) and set_pin_l0_filter_and_index_blocks_in_cache(true) will pin blocks in cache and can grow resident memory across many CFs. If the block cache is modest or not explicitly sized elsewhere, this may cause memory bloat or eviction pressure on data blocks.
    Suggestion: either gate pinning by CF size, or make it configurable, or ensure a sufficiently large shared block cache is configured globally.
    File: crates/storage/backend/rocksdb.rs:87-97

  2. Partitioned filters enabled even when no bloom filter is configured
    set_partition_filters(true) is applied for all CFs, but some CFs do not set a bloom filter (e.g., HEADERS, BODIES, ACCOUNT_CODES, RECEIPTS, and default CFs). This is likely a no-op but adds config noise and could complicate reasoning about performance.
    Suggestion: only enable set_partition_filters(true) for CFs that also set a bloom filter, or document that it’s intentionally applied globally.
    File: crates/storage/backend/rocksdb.rs:87-97, :114-195

Performance Notes

  • Two-level index + partitioned filters is usually beneficial for large datasets, but there’s a CPU hit on some workloads. Consider scoping to CFs that are known to be large and read-heavy (e.g., trie nodes), or gate behind a config flag if operators want to tune.

Security / Correctness

  • No direct security or correctness issues spotted.
  • No changes touching EVM opcodes, gas, consensus, trie logic, or RLP.

If you want, I can check for existing block cache config to see if the pinning is safe by default.


Automated review by OpenAI Codex · custom prompt

@azteca1998 azteca1998 linked an issue Feb 12, 2026 that may be closed by this pull request
@github-actions
Copy link

Lines of code report

Total lines added: 15
Total lines removed: 0
Total lines changed: 15

Detailed view
+------------------------------------------+-------+------+
| File                                     | Lines | Diff |
+------------------------------------------+-------+------+
| ethrex/crates/storage/backend/rocksdb.rs | 330   | +15  |
+------------------------------------------+-------+------+

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance Block execution throughput and performance in general

Projects

Status: Todo

Development

Successfully merging this pull request may close these issues.

Use two level index

1 participant