Skip to content

fix: pruner OOM — skip input fetching, fix slice race, auto-configure GOMEMLIMIT#538

Open
freemans13 wants to merge 1 commit intobsv-blockchain:mainfrom
freemans13:fix/pruner-oom
Open

fix: pruner OOM — skip input fetching, fix slice race, auto-configure GOMEMLIMIT#538
freemans13 wants to merge 1 commit intobsv-blockchain:mainfrom
freemans13:fix/pruner-oom

Conversation

@freemans13
Copy link
Collaborator

Summary

  • Skip inputs bin + parsing when skipParentUpdates=true: The existing skipParentUpdates setting already skipped the final flush of parent updates, but the pruner was still fetching the inputs bin from Aerospike, parsing every input into bt.Input objects, and accumulating them into allParentUpdates — only to discard everything. For billion-record prune cycles at 1.3M records/sec, this means billions of wasted allocations and significant network I/O for the largest per-record bin. Now the setting also skips the upstream fetch/parse/accumulate, eliminating the allocation pressure at its source.

  • Fix slice reuse data race in partitionWorker: chunk = chunk[:0] resets the slice length but keeps the same backing array, which the main loop then overwrites via append while goroutines are still reading from it. Replaced with chunk = make([]*aerospike.Result, 0, s.chunkSize) to allocate a fresh backing array per chunk.

  • Auto-configure GOMEMLIMIT from cgroup limits: Reads cgroup v2/v1 memory limits at daemon startup and sets debug.SetMemoryLimit to 90% of the container limit. Combined with the existing GOGC=200, this lets Go allocate fast but tighten GC as it approaches the soft limit — preventing OOM kills while maintaining throughput. No-op on local dev (no cgroup files). Benefits all services, not just the pruner.

Test plan

  • make build-teranode compiles cleanly
  • make lint passes with 0 issues
  • go test -v -race ./services/pruner/... passes
  • go test -v -race ./daemon/... passes
  • Deploy to staging and monitor pruner memory during billion-record prune cycle — should stay within 2Gi
  • Verify -race flag in CI catches no data races in pruner path

🤖 Generated with Claude Code

…ce data race

When skipParentUpdates is enabled, the pruner was still fetching the inputs bin
from Aerospike, parsing every input into bt.Input objects, and accumulating
parent updates — only to discard them in flushCleanupBatches. This wastes
network I/O and causes billions of allocations during billion-record prune
cycles, contributing to OOM kills at the 2Gi container limit.

Now the existing skipParentUpdates setting also skips the upstream fetch/parse/
accumulate work. Additionally fixes a slice reuse data race where chunk[:0]
resets the length but shares the backing array with in-flight goroutines, and
adds automatic GOMEMLIMIT configuration from cgroup limits (90%) so Go's GC
can self-regulate within container memory budgets.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@github-actions
Copy link
Contributor

github-actions bot commented Feb 27, 2026

🤖 Claude Code Review

Status: Complete


Findings:

This PR introduces three important improvements: skipping unnecessary input fetching when skipParentUpdates=true, fixing a slice data race, and auto-configuring GOMEMLIMIT from cgroup limits. The race fix and memory limit improvements look solid.

However, there is a logic bug introduced by the skipParentUpdates changes:

In processRecordChunk() around line 950, the code checks if len(inputs) > 0 to determine the file type for external files. When skipParentUpdates=true, the inputs variable is declared inside the conditional block (line 918-919) and will not be initialized, causing this check to always evaluate to false.

Suggested fix: Move var inputs []*bt.Input to line 917 (before the if !s.skipParentUpdates block) so it is always initialized to an empty slice, ensuring the len(inputs) > 0 check works correctly in both modes.

The memory limit and race fix changes are excellent and address real production issues.

@sonarqubecloud
Copy link

Quality Gate Failed Quality Gate failed

Failed conditions
71.9% Coverage on New Code (required ≥ 80%)

See analysis details on SonarQube Cloud

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant