Skip to content

perf(l1): enable per-level LZ4/Zstd compression on RocksDB trie and FKV column families#6176

Open
ilitteri wants to merge 3 commits intomainfrom
perf/rocksdb-compression
Open

perf(l1): enable per-level LZ4/Zstd compression on RocksDB trie and FKV column families#6176
ilitteri wants to merge 3 commits intomainfrom
perf/rocksdb-compression

Conversation

@ilitteri
Copy link
Collaborator

@ilitteri ilitteri commented Feb 11, 2026

Motivation

CPU profiling on EXPB benchmarks (gigablocks and slow scenarios) shows RocksDB compaction consuming 36-60% of cumulative CPU time, with PosixRandomAccessFile::Read at 9-12% flat — indicating heavy disk I/O from uncompressed SST files in the four most write-heavy column families (trie nodes and flat key-values).

Description

Add per-level compression to ACCOUNT_TRIE_NODES, STORAGE_TRIE_NODES, ACCOUNT_FLATKEYVALUE, and STORAGE_FLATKEYVALUE column families:

  • L0-L1: None (recently flushed, short-lived data — compression overhead not worth it)
  • L2-L4: LZ4 (fast compression for mid-tier levels)
  • L5-L6: Zstd (best compression ratio for cold data)

RLP-encoded trie node and flat key-value data compresses 40-70%. This reduces SST file sizes, compaction I/O, and foreground read bandwidth at the cost of some CPU for compression/decompression.

Also adds the zstd feature to the workspace rocksdb dependency (it previously only had lz4).

No resync required

This is not a breaking change. RocksDB stores the compression type in each SST file's metadata, so it can read both compressed and uncompressed files transparently. Existing uncompressed SST files remain readable as-is — new compressed files are only written as compaction runs. The database gradually transitions to compressed format over time with no manual intervention.

Benchmark Results (EXPB on ethrex-office-3)

All runs use CPU profiling enabled. Fresh baselines (#63-65) and compression runs (#61-62, #66) were run back-to-back on the same machine.

Throughput

Scenario Main (Mgas/s) Compression (Mgas/s) Change
fast (200 mainnet blocks) 420.48 544.21 +29.4%
gigablocks (100 synthetic max-gas blocks) 909.63 1232.8 +35.5%
slow (5000 mainnet blocks) 811.04 877.06 +8.1%

Latency — Fast (200 mainnet blocks)

Metric Main Compression Change
avg 67.81 ms 54.86 ms -19.1%
median 52.98 ms 36.75 ms -30.6%
p95 148.88 ms 146.71 ms -1.5%
p99 295.55 ms 208.66 ms -29.4%

Latency — Gigablocks (100 synthetic max-gas blocks)

Metric Main Compression Change
avg 1410 ms 1020 ms -27.7%
median 1200 ms 854 ms -28.8%
p95 2760 ms 1720 ms -37.7%
p99 4340 ms 2220 ms -48.8%

Latency — Slow (5000 mainnet blocks)

Metric Main Compression Change
avg 29.48 ms 26.40 ms -10.4%
median 21.79 ms 20.22 ms -7.2%
p95 76.03 ms 68.29 ms -10.2%
p99 133.54 ms 119.07 ms -10.8%

CPU Profile Comparison — Gigablocks

Function Main (#64) Compression (#61) Change
Wall time 230.28s 183.38s -20.4%
Total CPU samples 49833ms (21.64%) 48792ms (26.61%) -2% abs, +23% util
CompactionJob::Run 18174ms (36.47%) 22475ms (46.06%) +24% abs (includes compression)
CompressBlock (NEW) 6566ms (13.46%) Compression overhead
VM::execute 12370ms (24.82%) 9708ms (19.90%) -21.5% abs
rayon::join (merkle) 10573ms (21.22%) 8936ms (18.31%) -15.5% abs
PosixFile::Read (disk I/O) 5854ms (11.75%) 5069ms (10.39%) -13.4% abs

CPU Profile Comparison — Slow

Function Main (#65) Compression (#62) Change
Wall time 1826.40s 1818.78s -0.4%
Total CPU samples 127137ms (6.96%) 158414ms (8.71%) +24.6% abs, +25% util
CompactionJob::Run 76185ms (59.92%) 109357ms (69.03%) +43.5% abs (includes compression)
CompressBlock (NEW) 30059ms (18.97%) Compression overhead
VM::execute 13244ms (10.42%) 12224ms (7.72%) -7.7% abs
PosixFile::Read (disk I/O) 11178ms (8.79%) 11602ms (7.32%) +3.8% abs
apply_trie_updates 10368ms (8.15%) 9875ms (6.23%) -4.8% abs
WriteBatch::Iterate 7893ms (6.21%) 7414ms (4.68%) -6.1% abs

Key takeaway: Compression trades ~13-19% of CPU samples for reduced I/O. The gigablocks scenario benefits most — wall time dropped 20% because compaction processes smaller (compressed) SST files, reducing write stall pressure during heavy block execution. The slow scenario shows moderate improvement with block latency dropping 10%.

How to Test

  1. Run EXPB benchmarks: fast, gigablocks, and slow scenarios with CPU profiling enabled
  2. Verify Hoodi snap sync completes without errors (tested on ethrex-mainnet-test-2 — synced to chain tip in ~20 min)
  3. Verify node follows head normally after sync

…n families

Add per-level compression to ACCOUNT_TRIE_NODES, STORAGE_TRIE_NODES,
ACCOUNT_FLATKEYVALUE, and STORAGE_FLATKEYVALUE column families:
- L0-L1: None (recently flushed, short-lived data)
- L2-L4: LZ4 (fast compression for mid-tier levels)
- L5-L6: Zstd (best compression ratio for cold data)

These four CFs are the most write-heavy and were previously uncompressed.
RLP-encoded trie node and flat key-value data is highly compressible
(40-70% size reduction), so enabling compression reduces SST file sizes,
compaction I/O, and foreground read bandwidth. Expected 5-15% total CPU
reduction across workloads.
The per-level compression config uses Zstd for L5-L6, but the rocksdb
crate was only built with LZ4 support. Add the "zstd" feature to the
workspace dependency so Zstd compression is linked into the binary.
Copilot AI review requested due to automatic review settings February 11, 2026 14:52
@ilitteri ilitteri requested a review from a team as a code owner February 11, 2026 14:52
@ilitteri ilitteri added performance Block execution throughput and performance in general L1 Ethereum client labels Feb 11, 2026
@ethrex-project-sync ethrex-project-sync bot moved this to In Review in ethrex_l1 Feb 11, 2026
@github-actions
Copy link

🤖 Kimi Code Review

Review Summary

This PR adds zstd compression support to RocksDB and configures per-level compression strategies for different column families. The changes are well-intentioned but have some issues that need addressing.

Issues Found

1. Missing Error Handling (Critical)

File: crates/storage/backend/rocksdb.rs
Lines: 131-144 and 156-169

The set_compression_per_level method can fail if the compression type is not supported. This should be handled:

// Current code (lines 131-144):
cf_opts.set_compression_per_level(&[
    rocksdb::DBCompressionType::None,
    // ... other levels
]);

// Should check if zstd is actually available:
if !rocksdb::DBCompressionType::Zstd.is_supported() {
    warn!("Zstd compression not supported, falling back to LZ4");
    // Use alternative compression strategy
}

2. Configuration Duplication (Maintainability)

File: crates/storage/backend/rocksdb.rs
Lines: 125-169

The compression configuration is duplicated for both column families. Extract this into a helper function:

fn configure_compression_options(opts: &mut Options) {
    let compression_levels = if rocksdb::DBCompressionType::Zstd.is_supported() {
        vec![
            rocksdb::DBCompressionType::None,
            rocksdb::DBCompressionType::None,
            rocksdb::DBCompressionType::Lz4,
            rocksdb::DBCompressionType::Lz4,
            rocksdb::DBCompressionType::Lz4,
            rocksdb::DBCompressionType::Zstd,
            rocksdb::DBCompressionType::Zstd,
        ]
    } else {
        // Fallback configuration
        vec![/* ... */]
    };
    
    opts.set_compression_per_level(&compression_levels);
}

3. Missing Feature Gate (Compatibility)

File: Cargo.toml

The zstd feature should be conditional to avoid breaking builds on systems without zstd:

[dependencies]
rocksdb = { version = "0.24.0", default-features = false, features = [
  "bindgen-runtime",
  "lz4",
  "zstd",  # Consider making this optional
] }

4. Performance Considerations (Moderate)

The compression strategy assumes 7 levels (L0-L6), but RocksDB's default number of levels is 7. However, this should be explicitly verified:

// Add verification
assert_eq!(opts.get_num_levels(), 7, "Unexpected number of levels");

5. Documentation (Minor)

File: crates/storage/backend/rocksdb.rs
Lines: 127-130, 152-155

The comments about compression ratios (40-70%) should cite sources or be marked as estimates:

// TODO: Verify compression ratios through benchmarking
// Preliminary tests suggest RLP-encoded trie nodes compress 40-70%

Security Considerations

  • No immediate security vulnerabilities found
  • Compression can potentially enable compression oracle attacks, but this is mitigated by the fact that blockchain data is public anyway

Recommendations

  1. Add runtime checks for compression support
  2. Extract duplicate configuration into helper functions
  3. Add proper error handling for compression configuration
  4. Consider making zstd an optional feature
  5. Add metrics to monitor compression effectiveness

The PR is generally well-structured but needs these improvements for production readiness.


Automated review by Kimi (Moonshot AI) · custom prompt

@github-actions
Copy link

🤖 Codex Code Review

Adds per‑level compression and zstd dependency only. Mostly fine; a few things to consider:

  • Potential runtime crash if RocksDB is built without zstd support: you enable zstd feature but if dynamic lib missing in some environments, opening DB may fail. Consider documenting the new dependency or adding a startup check/log. crates/storage/backend/rocksdb.rs (around lines 131–147, 144–159).
  • Assumed level count hard-coded to 7 entries. If column family is configured with a different num_levels, RocksDB will still accept, but extra levels won’t use intended compression. Consider setting cf_opts.set_num_levels(7) to match the array or using DB::set_options/config to ensure consistency. crates/storage/backend/rocksdb.rs (around lines 131–147, 144–159).
  • Performance tradeoff: Zstd on L5–L6 increases CPU on compaction/read; ok for cold data but can impact sync/initial import. Consider making this configurable or gated by a “compression level” config. Same locations as above.
  • No tests or metrics changes: consider adding a benchmark or noting expected compression ratios in docs instead of code comments. (Comment claims “40–70%” without source.) crates/storage/backend/rocksdb.rs (around lines 128–139, 141–152).

If you want, I can check for config hooks or write a small runtime guard to log missing zstd support.


Automated review by OpenAI Codex · custom prompt

@github-actions
Copy link

🤖 Claude Code Review

Perfect! Now I have all the context I need. Let me provide a comprehensive code review.

Code Review - PR #6176: Per-Level RocksDB Compression

Summary

This PR adds per-level compression configuration to RocksDB for trie and flat key-value column families, targeting a significant performance improvement with impressive benchmark results (+35.5% throughput on gigablocks, -48.8% p99 latency).


✅ Strengths

  1. Well-motivated change: CPU profiling data clearly identifies the bottleneck (36-60% time in compaction, 9-12% in disk I/O)
  2. Excellent benchmark results: Significant performance improvements across all metrics
  3. Good compression strategy: The tiered approach (None → LZ4 → Zstd) is sensible for LSM tree levels
  4. Code clarity: Comments explain the rationale inline at crates/storage/backend/rocksdb.rs:128-130 and crates/storage/backend/rocksdb.rs:153-155

🔍 Findings

1. Potential Mismatch with RocksDB Level Count (Medium Priority)

Location: crates/storage/backend/rocksdb.rs:131-139 and crates/storage/backend/rocksdb.rs:156-164

The compression configuration specifies 7 levels (L0-L6), but RocksDB's actual level count depends on runtime configuration. With level_compaction_dynamic_level_bytes(true) (line 47), RocksDB dynamically adjusts which levels are active.

Concern: If RocksDB uses more than 7 levels in production (possible with large datasets), levels beyond L6 will have undefined compression behavior. The rocksdb crate documentation indicates that set_compression_per_level will apply the last specified compression type to all higher levels if they exist.

Recommendation:

  • Consider extending the array to cover more levels (e.g., L7-L9) with Zstd
  • Or explicitly document the assumption that 7 levels is sufficient for expected dataset sizes
  • Add a test or assertion to verify level count assumptions
// Suggested improvement:
cf_opts.set_compression_per_level(&[
    rocksdb::DBCompressionType::None, // L0
    rocksdb::DBCompressionType::None, // L1
    rocksdb::DBCompressionType::Lz4,  // L2
    rocksdb::DBCompressionType::Lz4,  // L3
    rocksdb::DBCompressionType::Lz4,  // L4
    rocksdb::DBCompressionType::Zstd, // L5
    rocksdb::DBCompressionType::Zstd, // L6
    rocksdb::DBCompressionType::Zstd, // L7+ (if they exist)
]);

2. Code Duplication (Low Priority)

Location: crates/storage/backend/rocksdb.rs:131-139 and crates/storage/backend/rocksdb.rs:156-164

The exact same compression configuration is duplicated for trie and FKV column families. This creates a maintenance burden if the compression strategy needs to be adjusted.

Recommendation: Extract to a constant or helper:

const TRIE_AND_FKV_COMPRESSION_LEVELS: [rocksdb::DBCompressionType; 7] = [
    rocksdb::DBCompressionType::None, // L0
    rocksdb::DBCompressionType::None, // L1
    rocksdb::DBCompressionType::Lz4,  // L2
    rocksdb::DBCompressionType::Lz4,  // L3
    rocksdb::DBCompressionType::Lz4,  // L4
    rocksdb::DBCompressionType::Zstd, // L5
    rocksdb::DBCompressionType::Zstd, // L6
];

// Then use:
cf_opts.set_compression_per_level(&TRIE_AND_FKV_COMPRESSION_LEVELS);

3. Migration Consideration (Informational)

Impact: Existing databases

When users upgrade, existing SST files won't be automatically recompressed. The new compression only applies to:

  • Newly flushed memtables
  • Files created during future compactions

Question: Should the PR description mention this migration behavior? Users with large existing databases might not see immediate benefits until natural compaction occurs, or they might want to trigger manual compaction.

4. Dependency Version Verification (Low Priority)

Location: Cargo.toml:130-134

The PR adds the zstd feature to rocksdb 0.24.0.

Verification needed: Confirm that rocksdb 0.24.0's Zstd implementation is production-ready and doesn't have known issues. (This is likely fine, but worth a quick check of the rocksdb-rs changelog).

5. Alignment with Global Compression Settings (Minor)

Location: crates/storage/backend/rocksdb.rs:65

The global default is DBCompressionType::None (line 65), but this is overridden for compressible tables and now for trie/FKV tables with per-level compression. This is correct, but the interaction between global and per-CF settings could be documented.


🔒 Security Considerations

No security issues identified. Compression algorithms (LZ4, Zstd) are well-vetted and don't introduce security vulnerabilities in this context.


⚡ Performance Considerations

  1. CPU vs I/O tradeoff: The PR explicitly trades CPU for I/O reduction, which is appropriate given the profiling data showing I/O as the bottleneck.

  2. Decompression overhead on reads: Compressed data must be decompressed on read. The benchmark results show this is net positive, but consider:

    • Read-heavy workloads might have different characteristics
    • Cold storage reads (L5-L6 with Zstd) will have higher decompression cost
    • The benchmark results already validate this tradeoff is beneficial
  3. Compression ratio assumption: The 40-70% compression ratio mentioned in comments (line 130) should ideally be verified periodically as data patterns evolve.


🦀 Rust Best Practices

All good. The code follows idiomatic Rust patterns:

  • Proper use of RocksDB API
  • Clear ownership semantics
  • No unsafe code introduced

📝 Additional Suggestions

  1. Consider logging compression stats: RocksDB can expose compression statistics. Consider enabling periodic logging to verify actual compression ratios in production:
// Around line 76-77, when statistics are enabled:
opts.enable_statistics();
opts.set_stats_dump_period_sec(600);
  1. Documentation: Consider adding a comment about the CPU overhead tradeoff at the function level (line 30) for future maintainers.

✅ Conclusion

This is a high-quality PR with excellent motivation, clear implementation, and strong benchmark validation. The changes are well-targeted and appropriate for the identified bottleneck.

Recommendation: Approve with minor suggestions

The main suggestion is to extend the compression level array to cover more than 7 levels as a safety measure for large production datasets. The code duplication is a minor maintenance issue but not blocking. Everything else is solid.

The 35% throughput improvement and 48% p99 latency reduction speak for themselves - this is a significant performance win for Ethereum state storage operations.


Automated review by Claude (Anthropic) · custom prompt

@github-actions
Copy link

Lines of code report

Total lines added: 18
Total lines removed: 0
Total lines changed: 18

Detailed view
+------------------------------------------+-------+------+
| File                                     | Lines | Diff |
+------------------------------------------+-------+------+
| ethrex/crates/storage/backend/rocksdb.rs | 333   | +18  |
+------------------------------------------+-------+------+

@greptile-apps
Copy link

greptile-apps bot commented Feb 11, 2026

Greptile Overview

Greptile Summary

This PR enables the zstd feature on the workspace rocksdb dependency and configures per-level compression for the most write-heavy trie and flat key-value column families in RocksDBBackend::open.

Specifically, it keeps L0–L1 uncompressed to avoid CPU overhead on short-lived data, uses LZ4 for mid-tier levels (L2–L4), and Zstd for colder levels (L5–L6). This integrates cleanly with the existing per-column-family options initialization in crates/storage/backend/rocksdb.rs, without changing any read/write logic or table schemas.

Confidence Score: 5/5

  • This PR is safe to merge with minimal risk.
  • Changes are limited to enabling the RocksDB zstd feature and setting per-level compression on specific column families; no functional storage semantics or schemas were altered. The lockfile updates are consistent with the feature addition.
  • No files require special attention

Important Files Changed

Filename Overview
Cargo.toml Enables the RocksDB crate's zstd feature alongside existing lz4 support.
Cargo.lock Adds zstd-sys to the lockfile as a transitive dependency from enabling RocksDB zstd support.
crates/storage/backend/rocksdb.rs Configures per-level compression (None/LZ4/Zstd) for trie and flat key-value column families via set_compression_per_level.

Sequence Diagram

sequenceDiagram
    participant Node as Ethrex Node
    participant Backend as RocksDBBackend::open
    participant Rocks as RocksDB

    Node->>Backend: open(path)
    Backend->>Rocks: list_cf(opts, path)
    Rocks-->>Backend: existing CF names
    Backend->>Backend: build CF descriptors
    Note over Backend: For ACCOUNT_TRIE_NODES/STORAGE_TRIE_NODES
    Backend->>Backend: set_compression_per_level([None,None,Lz4,Lz4,Lz4,Zstd,Zstd])
    Note over Backend: For ACCOUNT_FLATKEYVALUE/STORAGE_FLATKEYVALUE
    Backend->>Backend: set_compression_per_level([None,None,Lz4,Lz4,Lz4,Zstd,Zstd])
    Backend->>Rocks: open_cf_descriptors(opts, path, descriptors)
    Rocks-->>Backend: DB handle
    Backend-->>Node: RocksDBBackend
    Note over Rocks: New compactions write SSTs compressed per level
Loading

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@github-actions
Copy link

github-actions bot commented Feb 11, 2026

Benchmark Block Execution Results Comparison Against Main

Command Mean [s] Min [s] Max [s] Relative
base 66.320 ± 0.525 65.655 67.055 1.01 ± 0.01
head 65.713 ± 0.562 64.944 66.471 1.00

cf_opts.set_target_file_size_base(256 * 1024 * 1024); // 256MB
cf_opts.set_memtable_prefix_bloom_ratio(0.2); // Bloom filter

// Per-level compression: same strategy as trie CFs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Since this is identical to the trie CFs block above, you could extract a shared constant to avoid the duplication:

const COMPRESSION_PER_LEVEL: [DBCompressionType; 7] = [
    DBCompressionType::None, DBCompressionType::None,
    DBCompressionType::Lz4, DBCompressionType::Lz4, DBCompressionType::Lz4,
    DBCompressionType::Zstd, DBCompressionType::Zstd,
];

Not a big deal with only 2 copies though — fine either way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

L1 Ethereum client performance Block execution throughput and performance in general

Projects

Status: In Review
Status: Todo

Development

Successfully merging this pull request may close these issues.

3 participants