Skip to content

Update lifecycle for compaction, truncate, improve bftree file management#1780

Open
badrishc wants to merge 17 commits intodevfrom
badrishc/bftree-compaction-flush
Open

Update lifecycle for compaction, truncate, improve bftree file management#1780
badrishc wants to merge 17 commits intodevfrom
badrishc/bftree-compaction-flush

Conversation

@badrishc
Copy link
Copy Markdown
Collaborator

@badrishc badrishc commented May 8, 2026

No description provided.

Copilot AI review requested due to automatic review settings May 8, 2026 18:24
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

badrishc and others added 17 commits May 8, 2026 16:45
…WInfo.SourceAddress

Foundational Tsavorite-side changes for BfTree (RangeIndex) compaction lifecycle
integration:

- IRecordTriggers.OnFlush gains a logicalAddress parameter so applications can name
  per-flush snapshot files immutably.
- New IRecordTriggers.PostCopyToTail(in src, srcAddr, ref dst, dstAddr) trigger
  fires from TryCopyToTail after CAS-success but BEFORE UnsealAndValidate, so dst
  is sealed during the callback (concurrent readers see SkipOnScan and retry).
  Re-introduces the post-CAS hook that PR 5168a2f removed (PostSingleWriter),
  needed by RangeIndex compaction path.
- New IRecordTriggers.OnTruncate(newBA) trigger fires from
  AllocatorBase.TruncateUntilAddressBlocking AFTER device truncation completes;
  also from the recovery-time trim path. Lets applications clean up external state
  (e.g. per-flush snapshot files) tied to log addresses below the new BeginAddress.
- RMWInfo.SourceAddress added so RIPROMOTE PostCopyUpdater can plumb the source
  address (Address gets reassigned to the destination address by the time
  CopyUpdater runs).
- PendingContext.originalAddress is now also used by TryCopyToTail's PostCopyToTail
  fire path to surface the source logical address. Set at three call sites:
  CompactionConditionalCopyToTail (currentAddress), InternalRead's CopyFromImmutable
  (recSrc.LogicalAddress), and ContinuePending (request.logicalAddress).
- SpanByteScanIterator and ObjectScanIterator now invoke OnDiskRead in the
  disk-frame branch so disk-source scan records arrive at PostCopyToTail with
  invalidated TreeHandle bytes.

Storage tier baseline preserved: BasicLockTests, SpanByteLogCompactionTests, and
BasicStorageTests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, address-tagged flush, PostCTT)

Refactors RangeIndexManager and GarnetRecordTriggers to fix three defects in BfTree
lifecycle integration:

DEFECTS FIXED
- Recovery sees post-checkpoint BfTree state (silent corruption): a single deterministic
  flush.bftree per key was overwritten on every flush, so a recover-to-prior-checkpoint
  read the wrong state. Replaced with immutable, address-tagged
  <hash>.<addr:x16>.flush.bftree files.
- Compaction's TryCopyToTail leaves dst with a dangling TreeHandle: the previous
  PostSingleWriter hook was removed in 5168a2f. Re-introduce via the new
  IRecordTriggers.PostCopyToTail trigger; for live-source the trigger transfers
  ownership (clears src.TreeHandle), for disk-source it pre-stages data.bftree from
  the per-flush snapshot and registers a pending entry.
- Shift/ShiftForced (and ad-hoc Log.Truncate) leak orphan flush files: handled by the
  new OnTruncate trigger, which deletes per-flush snapshots whose addr < newBA and
  cleans orphaned .tmp files.

LATENT BUG ALSO FIXED
- RIPROMOTE'd dst with TreeHandle=0 captured by checkpoint had no snapshot file →
  recovery to that checkpoint failed. RIPROMOTE PostCopyUpdater now branches on
  src.TreeHandle: !=0 live transfer (existing behavior); ==0 cold case pre-stages
  data.bftree from <srcAddr>.flush.bftree and registers a pending entry so the
  checkpoint snapshot captures it correctly.

KEY DESIGN CHANGES
- liveIndexes re-keyed by Guid (XxHash128 of the key bytes, same scheme as the on-disk
  filename prefix). Hot-path access only for checkpoint coordination
  (WaitForTreeCheckpoint), gated by checkpointInProgress short-circuit.
- TreeEntry.Tree is nullable: null = pending (data.bftree pre-staged but no native tree
  opened yet). Pending entries are upgraded to activated on first RestoreTree call.
- TWO roots, mirroring Tsavorite's hlog/cpr-checkpoints separation:
  riLogRoot = {LogDir ?? CheckpointDir ?? cwd}/Store/rangeindex/   (working+flush)
  cprDir/<token>/rangeindex/                                       (per-checkpoint)
  Per-checkpoint snapshots are deleted automatically when Tsavorite removes the parent
  token directory; PurgeOldCheckpointSnapshots removed.
- Flat layout (no per-key subdirectories): single Directory.CreateDirectory at startup,
  single EnumerateFiles for OnTruncate, no orphan-dir cleanup needed.
- RestoreTree (renamed from RestoreTreeFromFlush) is now trivial: opens data.bftree
  directly. Pre-staging always happened earlier (PostCopyToTail-cold,
  RIPROMOTE-cold, OnRecoverySnapshotRead).
- OnRecoverySnapshotRead pre-stages data.bftree from cpr-checkpoints/<recoveredToken>/
  rangeindex/<hash>.bftree DURING recovery (snapshot files may be deleted post-recovery).
- Storage tier optional: works without LogDir (falls back to CheckpointDir → cwd).

API CHANGES (Tsavorite)
- IRecordTriggers.OnFlush now takes (ref LogRecord, long logicalAddress).
- New IRecordTriggers.PostCopyToTail<TSource>(in src, srcAddr, ref dst, dstAddr).
- New IRecordTriggers.OnTruncate(long newBeginAddress).
- RMWInfo.SourceAddress added; preserved through CopyUpdater so PostCopyUpdater can
  see the source address (rmwInfo.Address gets reassigned to dst by then).
- PendingContext.originalAddress is now also used by TryCopyToTail's PostCopyToTail
  fire path; set at three call sites (compaction, CopyReadsToTail, ContinuePending).
- SpanByteScanIterator and ObjectScanIterator now invoke OnDiskRead in the disk-frame
  branch.

API CHANGES (Garnet)
- BfTreeService.SnapshotToFileByPtr(handle, dataPath, targetPath) — static helper for
  OnFlush which has only the stub's TreeHandle, not the managed wrapper.
- RangeIndexManager.IsTreeLive removed; metrics use stub.TreeHandle != 0 directly.
- RangeIndexManager constructor signature: (enabled, riLogRoot, cprDir, removeOutdated, logger).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…sk-cleanup tests for new layout

Tsavorite: new RecordTriggersExtTests (9 tests) covering OnFlush(addr), OnTruncate,
PostCopyToTail, plus default-trigger flag verification. Updates RecordLifecycleTests'
LifecycleRecordTriggers.OnFlush to the new (ref LogRecord, long) signature so the
existing CallOnFlushGatesOnFlushInvocation test fires correctly.

Garnet: updates RIDiskFileCleanupOnDeleteTest and
RIDiskFileCleanupOnDeleteAfterEvictionAndRestoreTest to expect the new flat file
layout under {LogDir}/Store/rangeindex/<hash>.* instead of the old
{CheckpointDir}/checkpoints/rangeindex/<hash>/ per-key directories.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…te orphan-tmp cleanup

RIFlushFilesAreImmutablePerAddressTest: directly verifies the on-disk invariant that
fixes Defect 1. Inserts v1, force-flushes (creates <hash>.<A1>.flush.bftree), captures
file content, mutates to v2, force-flushes again. Asserts the v1 file STILL exists
with byte-identical content (proving immutability) AND that a NEW v2 file was created
at a distinct address (proving second flush did not overwrite). Pre-fix would fail:
single deterministic flush.bftree was overwritten on every flush.

RIOnTruncateRemovesOrphanTmpFilesTest: drops a bogus .data.bftree.tmp file in the
riLogRoot, triggers ShiftBeginAddress(_, truncateLog: true), and asserts the orphan
tmp file is deleted. Verifies the new IRecordTriggers.OnTruncate trigger fires post-
device-truncate and that GarnetRecordTriggers.OnTruncate cleans .tmp files.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Updates the IRecordTriggers Lifecycle table for the new triggers (PostCopyToTail,
OnTruncate) and the OnFlush(addr) signature change. Adds new sections:
- File Layout (two roots): riLogRoot for log-tied files, cprDir/<token>/rangeindex/
  for per-checkpoint snapshots; both flat layouts (no per-key subdirs).
- liveIndexes — Guid-keyed single dictionary; activated vs pending entries.
- liveIndexes hot-path discipline (only WaitForTreeCheckpoint, gated by short-circuit).
- Updated Lazy Promote to describe the new RIPROMOTE PostCopyUpdater cold branch.
- Updated Lazy Restore to describe the trivial RestoreTree (data.bftree direct open).
- Updated Checkpoint Consistency: SnapshotPending=1 on activated AND pending entries;
  pending → File.Copy(data.bftree → snapshot path).
- New Recovery Flow section: OnRecoverySnapshotRead pre-stages above-FUA stubs DURING
  recovery; below-FUA stubs handled lazily via RIPROMOTE-cold + per-flush snapshots.
- New Compaction Lifecycle section: PostCopyToTail live transfer vs cold pre-stage.
- New Scratchpad: Key Plumbing for PostCopyToTail (originalAddress + RMWInfo.SourceAddress).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… cleanup

The previous guard 'if (deleteFiles && stub.StorageBackend == Disk) DeleteTreeFiles(key)'
was a fragile defense that:

(a) Fails defense in depth: DeleteTreeFiles is already idempotent (enumerate-and-delete
    on a hash prefix; no-op if no files match), so the backend check adds no protection
    while creating a path-dependent leak.

(b) Forecloses future memory-snapshot support: the native bftree library exposes
    bftree_snapshot_memory / bftree_recover_memory (currently NotSupported but planned).
    When that ships, a memory-backed tree could legitimately have on-disk artifacts via
    OnFlush / PreStageAndRegisterPending. Without removing this guard, DEL on such a key
    would leak files.

(c) Fails to clean up stale prior-incarnation files: any <hash>.* file existing on disk
    when the key is re-created (or when the backend value is mutated through a future
    code path) would be permanently unreclaimable on DEL.

Drop the guard. DeleteTreeFiles always cleans up any <hash>.* files in riLogRoot
when deleteFiles=true, regardless of current StorageBackend.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…oundUpToPowerOf2

These were inadvertently dropped during the larger refactor of RangeIndexManager.cs.
Restoring the original docs:
- ComputeLeafPageSize: full <param>/<returns> describing the size buckets.
- ComputeLeafPageSize body: '2.5x, capped at 32KB' and 'Round up to next power of 2'
  inline comments at the bucket-computation steps.
- RoundUpToPowerOf2: <summary> describing the bit-manipulation algorithm.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… File.Move pattern)

Both PreStageAndRegisterPending and RebuildFromSnapshotIfPending used a .tmp + File.Move
atomic-rename pattern that on closer inspection adds no safety:

PreStageAndRegisterPending callers (PostCopyToTail, RIPROMOTE PostCopyUpdater) run
BEFORE UnsealAndValidate on the destination record, so concurrent readers see SkipOnScan
and retry — no read race possible. Tsavorite's hash-bucket CAS serializes concurrent
writers per key (only one CopyUpdater/TryCopyToTail succeeds per CAS event), so two
PreStage calls cannot race on the same data.bftree file. OnEvict closes the prior
native BfTree before any cold-restore can happen, so no live native fd is open on
data.bftree when this runs.

RebuildFromSnapshotIfPending runs DURING recovery, which is single-threaded with no
concurrent readers. A crash mid-copy is self-healing: the next recovery attempt
re-fires OnRecoverySnapshotRead for the same stub and overwrites any partial file
before any RestoreTree can observe it.

Both functions now use direct File.Copy(src, dst, overwrite: true). Drop the
LogTmpPath helper (no longer used); keep the .tmp suffix cleanup in OnTruncateImpl
as defense in depth for any legacy or future atomic-rename code paths.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…exclusive lock

Critical concurrency fix exposed by the question 'is RIPROMOTE running under exclusive
lock?'. The answer is no:

  • ReadRangeIndex RELEASES the shared lock before invoking PromoteToTail.
  • There is no TryPromoteSharedLock upgrade anywhere in RangeIndex code.
  • RIPROMOTE's RMW (CopyUpdater + PostCopyUpdater) runs with no rangeIndex lock held.

Combined with: CASRecordIntoChain in Helpers.cs:177 unseals dst IMMEDIATELY on CAS
success, so by the time PostCopyUpdater fires, concurrent readers can already observe
dst with TreeHandle == 0 and invoke RestoreTree, which opens data.bftree directly.

Race scenario (cold case):
1. Thread A: PostCopyUpdater starts PreStageAndRegisterPending — File.Copy partway done.
2. Thread B: ReadRangeIndex sees the unsealed dst with TreeHandle=0, releases shared
   lock, acquires exclusive lock for RestoreTree.
3. Thread B: RestoreTree opens data.bftree → reads partial file → corruption.

Atomic-rename .tmp + File.Move doesn't fix this either — Thread B might see 'file
doesn't exist' before the rename completes and return spurious NOTFOUND.

Fix: PreStageAndRegisterPending now acquires the per-key EXCLUSIVE rangeIndex lock for
the duration of File.Copy AND the pending-entry registration. This blocks RestoreTree's
exclusive acquire until pre-stage completes; Thread B then sees the fully-written
data.bftree.

A direct File.Copy is sufficient under this lock — the exclusive lock serializes
against any reader that would open data.bftree, against other concurrent
PreStageAndRegisterPending calls for the same key, and crash-mid-copy is self-healing
via OnRecoverySnapshotRead or RIPROMOTE-cold re-pre-stage.

RebuildFromSnapshotIfPending continues to use direct File.Copy without a lock —
recovery is single-threaded with no concurrent readers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…lush invariant, OnTruncate)

ISSUE 1 (BLOCKER): Stale source eviction removed the live entry.

  After PostCopyToTail-live or RIPROMOTE-live, the source record's TreeHandle was
  cleared but the record otherwise stayed in the chain. When the source page later
  evicted, OnEvict→DisposeTreeUnderLock saw TreeHandle==0 and hit the early branch,
  removing the liveIndexes entry that NOW belonged to the destination — losing
  checkpoint coverage and DEL-time native-tree disposal.

  Fix: Add RangeIndexStub.FlagTransferred = 4 (uses one of the reserved Flag bits;
  no stub-layout change). Both transfer paths now set this flag on src in BOTH live
  and cold branches:
    • GarnetRecordTriggers.PostCopyToTail (live + cold).
    • RMWMethods.PostCopyUpdater (RIPROMOTE live + cold).
  DisposeTreeUnderLock checks FlagTransferred FIRST — when set, no-op (the entry
  belongs to a newer destination record). For DEL/UNLINK (deleteFiles=true) we
  still proceed with file cleanup.

  GarnetRecordTriggers.OnFlush also checks FlagTransferred — a transferred-out src
  must NOT snapshot a stale view (TreeHandle was cleared, and any snapshot from
  data.bftree would race with dst's active mutations).

  Regression test: RIDisposeTreeUnderLockNoOpsOnTransferredSourceTest.
  Verified the test fails when the FlagTransferred check is removed.

ISSUE 2 (MAJOR): Pre-stage failures left dst in a broken state.

  PreStageAndRegisterPending logged a warning and returned silently on missing
  source flush file or copy failure. The caller (PostCopyToTail / PostCopyUpdater)
  cleared FlagFlushed unconditionally, leaving an above-FUA dst stub with
  TreeHandle=0, FlagFlushed=0, no data.bftree, and no pending entry — violating
  Invariants 4 and 5.

  Same problem in OnFlush: when TreeHandle=0 and data.bftree was missing, OnFlush
  silently skipped the snapshot but still set FlagFlushed → unrestorable state.

  Fix: PreStageAndRegisterPending logs as ERROR (not warning) on source-missing
  and explicitly does NOT fall back to any other flush file (per user direction:
  recovering the wrong tree version is worse than failing). The destination's
  RestoreTree will return NOTFOUND for the affected key, surfacing the data loss
  explicitly. OnFlush also logs as ERROR and DOES NOT set FlagFlushed when the
  source data.bftree is missing — the next access routes through RestoreTree
  which returns NOTFOUND for the key while leaving the record otherwise consistent.

ISSUE 3 (MAJOR): Cluster mode disabled all RangeIndex truncation cleanup.

  OnTruncateImpl returned early when removeOutdatedCheckpoints=false (cluster
  mode), leaking obsolete <hash>.<addr>.flush.bftree files and stale .tmp files
  indefinitely.

  Per-flush files are LOG-tied (their lifetime tracks log addresses), not
  checkpoint-tied — they are safe to delete once Tsavorite's BeginAddress passes
  their address, regardless of cluster mode. removeOutdatedCheckpoints controls
  CHECKPOINT-token-tied snapshot retention only.

  cleanup is unconditional.

All 61 RangeIndex tests + 35 Tsavorite tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Bug: DeleteTreeFiles indiscriminately deleted ALL <hash>.* files in riLogRoot,
including <hash>.<addr:x16>.flush.bftree files that may still be required to recover
an OLDER checkpoint taken BEFORE the DEL operation.

Pre-fix data-loss flow:
  T=0: RI.CREATE key K, write, flush → <hash>.<A1>.flush.bftree
  T=1: SAVE (checkpoint C1)
  T=2: more writes, more flushes → <hash>.<A2>.flush.bftree
  T=3: DEL K → ALL <hash>.* files deleted, including <A1>.flush.bftree
  T=4: crash; recover to C1 → recovered log has stub at A1 with FlagFlushed=1
  T=5: RI.GET → RIPROMOTE-cold → looks for <hash>.<A1>.flush.bftree → MISSING
  → silent data loss for the recovered key.

Fix: DeleteTreeFiles now only deletes the WORKING files:
  • <hash>.data.bftree
  • <hash>.data.bftree.tmp (defensive, in case of any legacy atomic-rename leftover)

Per-flush snapshot files (<hash>.<addr:x16>.flush.bftree) are LOG-tied: they may be
needed to recover an older checkpoint, and are reclaimed by OnTruncate only when
Tsavorite's BeginAddress passes their address — at which point no checkpoint can
recover the pre-delete state anyway.

Per-checkpoint snapshots under cpr-checkpoints/<token>/rangeindex/<hash>.bftree are
already not touched here (different root); Tsavorite removes them with the parent
token directory.

Side effect: DeleteTreeFiles no longer iterates the directory — direct File.Delete
on two known paths. The ONLY remaining directory iteration in production code is
now OnTruncateImpl (which is unavoidable: it scans for files matching addr<newBA).

Updated tests:
  • RIDiskFileCleanupOnDeleteTest now only asserts data.bftree is removed.
  • RIDiskFileCleanupOnDeleteAfterEvictionAndRestoreTest now asserts that the
    flush.bftree files PRESERVE their count after DEL (not zero).
  • New RIDeletePreservesPerFlushSnapshotFilesTest explicitly asserts the
    LOG-tied lifetime contract: DEL preserves <hash>.<addr>.flush.bftree files.

All 62 RangeIndex tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
No production path produces <hash>.data.bftree.tmp files. Every file-write path
in RangeIndexManager uses direct File.Copy(overwrite: true):
  • PreStageAndRegisterPending (under per-key exclusive lock)
  • RebuildFromSnapshotIfPending (single-threaded recovery)
  • Per-checkpoint snapshot capture
Verified across libs/ that no File.Move calls exist anywhere, and the Rust
BfTree library does not produce .tmp files either.

The .tmp handling that existed was speculative cleanup for 'legacy or future
atomic-rename code paths' (per its own doc comment). There is no legacy
deployment of this format (this PR introduces it), and there is no plan to
migrate to a .tmp + Move pattern (current design relies instead on:
  - Crash-safety of partial data.bftree via re-pre-stage on next access /
    next OnRecoverySnapshotRead overwrite.
  - The per-key exclusive lock serializing readers vs the in-flight Copy.)

Removed:
  • DeleteTreeFiles: deletes only <hash>.data.bftree (one File.Delete, no
    helper closure, no enumeration). Updated xmldoc.
  • OnTruncateImpl: dropped the .data.bftree.tmp branch (~6 lines + comment);
    only enumerates / matches <hash>.<addr:x16>.flush.bftree now.
  • RebuildFromSnapshotIfPending: dropped 'cf. .tmp + File.Move' aside from
    xmldoc; kept the crash-safety rationale.
  • GarnetRecordTriggers.OnDiskRead-cold branch: dropped '(atomic via .tmp +
    File.Move)' aside.
  • DisposeTreeUnderLock xmldoc: simplified deleteFiles description (no
    longer references .tmp).
  • Test RIOnTruncateRemovesOrphanTmpFilesTest: removed (purely synthetic;
    asserted behavior of code with no production caller).

All 61 RangeIndex tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Refactor RangeIndexStub flag handling to mirror Tsavorite's RecordInfo pattern:
typed properties for each bit, private backing field, centralized reset, and
explicit reservation of bits for future use.

Changes
-------
RangeIndexStub:
  • New top-of-struct comment block documenting the full bit layout:
      [Unused7][Unused6][Unused5][Pinned*][Modified*][Transferred][Recovered][Flushed]
  • Bit constants: kFlushedBitMask, kRecoveredBitMask, kTransferredBitMask
    (private const byte; mask-only form since byte is small enough that
    offset/mask split adds no value).
  • Bits 3 (Modified) and 4 (Pinned) RESERVED — masks deferred until those
    semantics actually exist (avoids inviting premature usage). Bits 5..7
    free for future expansion → ≥5 bits headroom after Modified+Pinned land.
  • IsFlushed / IsRecovered / IsTransferred are now SETTABLE properties
    (was readonly-get only) — get returns (flags & mask) != 0; set uses the
    standard ternary pattern. Inlined by JIT, byte-identical codegen vs
    the prior 'stub.Flags |= mask;' calls.
  • 'public byte Flags' renamed to 'private byte flags' — all writes must
    now go through typed properties or ResetFlags(), centralizing invariant
    control as more bits gain semantics.
  • New ResetFlags() instance method (AggressiveInlining) for the two
    legitimate full-reset callsites.
  • Old 'internal const byte FlagFlushed/FlagRecovered/FlagTransferred'
    members removed.

RangeIndexManager.Index.cs:
  • CreateIndex: 'stub.Flags = 0' → 'stub.ResetFlags()'.
  • RecreateIndex: '&= ~FlagRecovered' → 'stub.IsRecovered = false'.
  • Static span-based wrappers (SetFlushedFlag, ClearFlushedFlag,
    SetTransferredFlag, MarkRecoveredFromCheckpoint) reimplemented via the
    new properties. Helper signatures + AggressiveInlining unchanged so all
    callsites compile without modification and produce identical machine code.
  • Composite helpers (ClearTreeHandle, InvalidateStub,
    MarkRecoveredFromCheckpoint) kept as separate semantic entry points —
    their underlying writes may converge today but encode different
    invariants (ownership transfer vs disk-load staleness vs recovery
    marker).

RangeIndexManager.cs:
  • HandleRangeIndexCreateReplay (AOF replay): 'stub.Flags = 0' →
    'stub.ResetFlags()'.
  • Doc xref FlagFlushed → IsFlushed.

RangeIndexOps.cs:
  • Object initializer 'Flags = 0' line removed (struct default-initializes
    the private flags byte to 0).

RMWMethods.cs / RespRangeIndexTests.cs:
  • Comment / message references to FlagFlushed/FlagRecovered/FlagTransferred
    updated to IsFlushed/IsRecovered/IsTransferred.

Wire format
-----------
Unchanged. Byte 33 (the flags byte) keeps the same bit assignments. Existing
checkpoint snapshots, AOF entries, and on-disk records remain readable. The
test that constructs a raw 35-byte stub via 'staleStub[33] = 4' continues to
exercise the on-the-wire format directly (now with corrected boolean wording
in its assertion messages).

Perf
----
Zero. Property setters/getters are inlined; literal true/false at every
callsite folds the setter ternary at JIT time → same OR/AND/STORE as before.
'private' is purely a compile-time access modifier with no runtime cost.

All 61 RangeIndex tests pass.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comments
--------
Make all RangeIndex comments self-contained, present-tense, and accurate:
  • Replace stale FlagFlushed/FlagRecovered/FlagTransferred references in
    comments and doc strings with the current property names IsFlushed/
    IsRecovered/IsTransferred (also normalize 'FlagX=1' to 'IsX=true').
  • Test RIFlushFilesAreImmutablePerAddressTest: rewrite xmldoc as a
    behavioral spec for the per-flush snapshot file immutability invariant
    (drop 'Defect 1', 'Pre-fix behavior', 'Post-fix behavior'); rename
    test key 'defect1key' → 'flushtestkey'; remove 'Defect 1 invariant'
    label from assertion message.
  • Test RIDisposeTreeUnderLockNoOpsOnTransferredSourceTest: rewrite xmldoc
    and assertion messages as behavioral specs (drop 'GPT-5.5-flagged
    blocker', 'The bug: pre-fix...', 'The fix: ...'). The discriminating
    contrast (without IsTransferred → entry IS removed) remains as a
    present-tense spec check.
  • GarnetRecordTriggers.OnFlush cold branch: rephrase 'Invariant 1+5' /
    'violates Invariant 5' as 'per-flush snapshot invariant has been
    violated' and inline what the consequence is.

Format
------
Tsavorite test file RecordTriggersExtTests.cs had a trailing newline
beyond the canonical single-newline EOF; removed.

Verification
------------
  • dotnet format Garnet.slnx --verify-no-changes      → clean
  • dotnet format Tsavorite.slnx --verify-no-changes   → clean
  • dotnet build test/Garnet.test                      → 0 warnings, 0 errors
  • RespRangeIndexTests (61 tests)                     → all pass

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Intermittent CI failure on this PR (and any PR touching native bindings):

  Error: cargo.exe failed with exit code 101
  command: cargo metadata --all-features --format-version 1 --no-deps
  stderr: error: could not find Cargo.toml in D:\a\garnet\garnet or any parent directory

Root cause
----------
actions-rust-lang/setup-rust-toolchain@v1 defaults to cache: true with
cache-workspaces: '. -> target'. Internally it invokes Swatinem/rust-cache,
which runs 'cargo metadata' from the repo root to build the cache key. The
Garnet repo has no root Cargo.toml — the only Cargo workspace lives at
libs/native/bftree-garnet — so the metadata call exits 101.

The failure is intermittent because the cache action sometimes recovers
(falls back to the lockfile-based key it already discovered) and sometimes
surfaces the exit 101 as a step failure.

Fix
---
Pass cache-workspaces: libs/native/bftree-garnet to the action so rust-cache
runs cargo metadata from the directory that actually contains Cargo.toml.
The cache-key derivation continues to work and the spurious exit-101 error
goes away.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Regression introduced in commit 30885d3 ('RangeIndex: refactor for compaction
lifecycle'): the dev branch's OnFlush trigger delegated to RangeIndexManager.
SnapshotTreeForFlush(), which acquired the per-key exclusive lock around the
native bftree snapshot call. The refactor inlined the snapshot body directly
into the OnFlush trigger as a raw BfTreeService.SnapshotToFileByPtr(...) call
with NO lock — silently dropping the serialization guarantee.

Why this matters
----------------
At the moment OnFlush fires on a record, that record:
  - Is the chain head (no RIPROMOTE has occurred yet — IsTransferred=false).
  - Has TreeHandle pointing at the live native tree.
  - Has IsFlushed=false (about to be set TO true by OnFlush at the end).

Worker threads doing RI.SET / RI.DEL on this record use the same TreeHandle
to call native InsertByPtr / DeleteByPtr while holding the per-key SHARED
lock. Without the per-key exclusive lock in OnFlush, the native
SnapshotToFileByPtr can run concurrently with InsertByPtr on the same tree.
If the BfTree native lib does not fully serialize snapshot vs insert
internally, this races and can hang or corrupt.

This is the root cause of the rare RIConcurrentOpsWithCheckpointTest hang in
CI. The test maximizes the race window by having 4 worker threads
tight-looping RI.SET on a single key while a 5th thread takes a SAVE
checkpoint — the IN_PROGRESS → WAIT_FLUSH transition in Tsavorite's
checkpoint state machine triggers OnFlush per record, racing with workers
that entered before SetCheckpointBarrier set checkpointInProgress=true.

The IsTransferred early-return at the top of OnFlush is NOT enough on its
own. It only short-circuits re-firings of OnFlush on a stale source AFTER
the source has been transferred to a tail destination via PostCopyToTail
or RIPROMOTE PostCopyUpdater. For the FIRST OnFlush on any record,
IsTransferred=false because the transfer is triggered BY IsFlushed=true,
which OnFlush is the one to set.

Fix
---
Centralize the snapshot-for-flush logic in
RangeIndexManager.SnapshotTreeForFlush(key, valueSpan, logicalAddress) —
mirroring the dev branch pattern. Acquire the per-key exclusive lock around
the snapshot/copy body. The OnFlush trigger is now a thin delegation. This
restores the serialization guarantee against:
  - Concurrent worker InsertByPtr/DeleteByPtr on the same TreeHandle.
  - Concurrent SnapshotAllTreesForCheckpoint on the same TreeHandle (which
    already holds the same per-key exclusive lock).
  - Concurrent PreStageAndRegisterPending data.bftree rewrite (relevant to
    the cold-path File.Copy(data.bftree -> flush.bftree)).

The IsFlushed flag is now set INSIDE the lock body on success, and is
explicitly NOT set on the cold-path-data.bftree-missing path — preserving
the existing 'route through RestoreTree -> NOTFOUND' contract for
unrestorable records. (This is a small improvement over the prior code,
which set IsFlushed unconditionally after the native call returned.)

Test safety
-----------
Added [CancelAfter(120_000)] to RIConcurrentOpsWithCheckpointTest so any
remaining or unrelated hang in this stress test surfaces as an NUnit
timeout failure with a stack trace, in addition to CI's existing
--blame-hang-timeout dump.

Verification
------------
  - dotnet format Garnet.slnx --verify-no-changes -> clean
  - dotnet build test/Garnet.test                  -> 0 warnings, 0 errors
  - 61 RespRangeIndexTests                          -> all pass

Note: holding additional fixes pending the dev-vs-PR design audit
currently in progress (will land as separate commits if/when surfacing
other lost design aspects).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Restore three protective measures that were silently weakened during the
compaction-lifecycle refactor (commit 30885d3).

1. SnapshotTreeForFlush: don't gate memory-backend re-anchor on riLogRoot
---------------------------------------------------------------------------
The first line of SnapshotTreeForFlush early-returned on
'string.IsNullOrEmpty(riLogRoot)' BEFORE reaching the
'StorageBackend == Memory' branch that calls SetFlushedFlag. For a
memory-backed RI tree in any configuration where riLogRoot is null/empty
(unit tests instantiating RangeIndexManager(true) directly, or any future
deployment without a log dir), the memory branch was unreachable.

Without IsFlushed=true, ReadRangeIndex never triggers PromoteToTail
(RIPROMOTE) — the stub stays frozen in the read-only region. On first page
eviction, OnEvict -> DisposeTreeUnderLock disposes the live BfTree (no
IsTransferred guard fires). All subsequent accesses route through
RestoreTree -> no data.bftree (memory tree never had one) -> silent
NOTFOUND on a key that was supposed to live in memory.

In production today this is unreachable because GarnetServer.cs:321 always
computes a non-empty riLogRoot. Defense-in-depth fix: reorder so the memory
branch fires regardless of riLogRoot, and only gate disk-backed paths on
the riLogRoot check.

2. DisposeTreeUnderLock: take per-key lock around cold-eviction TryRemove
-------------------------------------------------------------------------
The cold-eviction branch (TreeHandle==0, deleteFiles=false) early-returned
with 'liveIndexes.TryRemove(KeyId(key), out _)' OUTSIDE the per-key
exclusive lock. Every other liveIndexes mutator (RegisterIndex,
UnregisterIndex, PreStageAndRegisterPending) holds the lock — this single
instance broke the discipline.

ConcurrentDictionary is thread-safe at the operation level, but invariants
spanning multiple operations need external serialization. Race scenario:
SetCheckpointBarrier marks pending entry's SnapshotPending=1, then a
concurrent cold eviction TryRemoves it before SnapshotAllTreesForCheckpoint
captures it. Result: silent checkpoint coverage gap for the affected key —
recovery from that checkpoint produces NOTFOUND.

Fix: acquire the lock first, then dispatch on TreeHandle/deleteFiles inside
the locked block. Cost: one extra lock acquisition on the cold-eviction
path, with no shared-lock contention because the page is being torn down.

3. RestoreTree: re-add Debug.Assert for missing data.bftree
-----------------------------------------------------------
Dev had Debug.Assert(File.Exists(restorePath), '...') for the recovered-stub
restore path. The PR's two-branch restore (checkpoint vs flush) was replaced
by a single 'always use data.bftree' path, and the assert was softened to a
LogWarning + return false.

The data.bftree pre-staging invariant ('every above-FUA stub with
TreeHandle=0 must have a valid data.bftree') is load-bearing for the lazy-
restore chain. If pre-staging breaks (in PostCopyToTail-cold, RIPROMOTE
PostCopyUpdater-cold, or OnRecoverySnapshotRead), the current code surfaces
the failure as a key silently returning NOTFOUND. The original Debug.Assert
would have caught it loudly in Debug-build CI.

Fix: restore the Debug-build assert immediately before the LogWarning.
Release builds: behavior unchanged. Debug builds (CI): assertion failure
with stack trace, immediately surfacing any pre-stage chain bug.

Skipped from audit
------------------
- Finding 4 (StorageBackend == Disk guard dropped from DeleteTreeFiles
  call): pure defensive-coding loss; DeleteTreeFiles guards via
  File.Exists, memory trees no-op. Not worth a fix.
- Finding 5 (PurgeOldCheckpointSnapshots dropped, cleanup delegated to
  Tsavorite): needs verification that Tsavorite's per-token directory
  cleanup correctly reaches the rangeindex/ subdir. Investigation
  deferred — not a fix.

Verification
------------
  - dotnet format Garnet.slnx --verify-no-changes  -> clean
  - dotnet build test/Garnet.test                  -> 0 warnings, 0 errors
  - 61 RespRangeIndexTests                         -> all pass

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@badrishc badrishc force-pushed the badrishc/bftree-compaction-flush branch from d2adb1b to baa7b40 Compare May 8, 2026 23:47
@badrishc badrishc requested a review from Copilot May 9, 2026 00:14
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 8 comments.

Comment on lines 41 to 60
// Insert the new record by CAS'ing either directly into the hash entry or splicing into the readcache/mainlog boundary.
var success = CASRecordIntoChain(newLogicalAddress, ref newLogRecord, ref stackCtx);
if (success)
{
// We don't call PostInitialWriter here so we must do the size tracking separately.
hlogBase.logSizeTracker?.UpdateSize(in newLogRecord, add: true);

// Fire the application-level post-CAS hook BEFORE unsealing dst, so concurrent readers
// observing dst still see SkipOnScan and retry. The hook may mutate dst's value bytes
// and/or the source record (similar to RIPROMOTE PostCopyUpdater for live transfers).
if (storeFunctions.CallPostCopyToTail)
{
var srcLogicalAddress = stackCtx.recSrc.HasMainLogSrc
? stackCtx.recSrc.LogicalAddress
: pendingContext.originalAddress;
storeFunctions.PostCopyToTail(in inputLogRecord, srcLogicalAddress, ref newLogRecord, newLogicalAddress);
}

newLogRecord.InfoRef.UnsealAndValidate();

{
var srcLogicalAddress = stackCtx.recSrc.HasMainLogSrc
? stackCtx.recSrc.LogicalAddress
: pendingContext.originalAddress;
this.dataDir = dataDir;
this.riLogRoot = riLogRoot;
this.cprDir = cprDir;
this.removeOutdatedCheckpoints = removeOutdatedCheckpoints;
Comment on lines +46 to +69
| `OnTruncate(newBA)` | After device truncate | Delete `<hash>.<addr:x16>.flush.bftree` files where `addr < newBA` plus orphaned `.tmp` files |
| `OnCheckpoint(VersionShift)` | PREPARE→IN_PROGRESS | Set checkpoint barrier; mark all entries (activated AND pending) `SnapshotPending=1` |
| `OnCheckpoint(FlushBegin)` | WAIT_FLUSH | Snapshot trees under exclusive lock; activated → `Tree.SnapshotToFile`; pending → `File.Copy(data.bftree → snapshot path)`; clear barrier |
| `OnCheckpoint(CheckpointCompleted)` | REST | No-op — Tsavorite removes per-token snapshot dirs when `removeOutdated=true`; per-flush files cleaned by `OnTruncate` |
| `OnRecovery(token)` | Before snapshot file recovery | Store recovered checkpoint token (used by `RebuildFromSnapshotIfPending`) |
| `OnRecoverySnapshotRead` | Per record from snapshot file | Set `FlagRecovered`; pre-stage `data.bftree` from `cpr-checkpoints/<token>/rangeindex/<hash>.bftree` and register pending entry (snapshot files may be deleted post-recovery) |
| `OnDispose(Deleted)` | DEL/UNLINK | Free BfTree under exclusive lock; delete all `<hash>.*` files in the riLogRoot |

### File Layout (two roots)

RangeIndex files live under **two roots**, mirroring Tsavorite's separation of log state
(`hlog`) from checkpoint state (`cpr-checkpoints`):

**Log-tied root** — `{LogDir ?? CheckpointDir ?? cwd}/Store/rangeindex/`
Co-located with `hlog.<seg>` when storage tier is enabled. Falls back through the same chain
that Tsavorite uses for `CheckpointBaseDirectory` so RangeIndex works without storage tier.
Lifetime tracks log addresses (cleared by `OnTruncate(newBA)`).

```
{riLogRoot}/
<hash>.data.bftree # bftree's working file
<hash>.data.bftree.tmp # transient (atomic copy intermediate)
<hash>.<addr:x16>.flush.bftree # immutable per-flush snapshot
```
| `OnCheckpoint(CheckpointCompleted)` | REST | No-op — Tsavorite removes per-token snapshot dirs when `removeOutdated=true`; per-flush files cleaned by `OnTruncate` |
| `OnRecovery(token)` | Before snapshot file recovery | Store recovered checkpoint token (used by `RebuildFromSnapshotIfPending`) |
| `OnRecoverySnapshotRead` | Per record from snapshot file | Set `FlagRecovered`; pre-stage `data.bftree` from `cpr-checkpoints/<token>/rangeindex/<hash>.bftree` and register pending entry (snapshot files may be deleted post-recovery) |
| `OnDispose(Deleted)` | DEL/UNLINK | Free BfTree under exclusive lock; delete all `<hash>.*` files in the riLogRoot |
Comment on lines +2364 to +2388
// Call DisposeTreeUnderLock as OnEvict would, with deleteFiles=false (eviction path).
// With IsTransferred=true, this must no-op — NOT remove the entry that belongs to
// the live record at the tail.
var keyBytes = System.Text.Encoding.ASCII.GetBytes("transtest");
rim.DisposeTreeUnderLock(keyBytes, staleStub, deleteFiles: false);

ClassicAssert.AreEqual(1, rim.LiveIndexCount,
"DisposeTreeUnderLock on a stale (IsTransferred=true) source must NOT remove the live entry " +
"that now belongs to the destination at the tail.");

// Live tree should still be functional.
ClassicAssert.AreEqual("value-v1", (string)db.Execute("RI.GET", "transtest", "field-x"));

// Discriminating contrast: a stub WITHOUT IsTransferred (pure pending entry, e.g.
// evicted before activation) WOULD remove the entry. This proves the discriminating
// power of the IsTransferred check.
var pendingStub = new byte[RangeIndexManager.IndexSizeBytes];
// IsTransferred=0, TreeHandle=0 — looks like a pending entry being evicted.
// Use a DIFFERENT key for this part to avoid disturbing the live entry above.
db.Execute("RI.CREATE", "pendkey", "DISK", "CACHESIZE", "65536", "MINRECORD", "8");
ClassicAssert.AreEqual(2, rim.LiveIndexCount, "second tree created");
// Now simulate eviction of a pending-entry stub for "pendkey".
var pendKeyBytes = System.Text.Encoding.ASCII.GetBytes("pendkey");
rim.DisposeTreeUnderLock(pendKeyBytes, pendingStub, deleteFiles: false);
ClassicAssert.AreEqual(1, rim.LiveIndexCount,
Comment on lines +515 to +529
public static void SnapshotToFileByPtr(nint handle, string dataPath, string targetPath)
{
if (handle == nint.Zero)
throw new ArgumentException("Native handle is null.", nameof(handle));
if (string.IsNullOrEmpty(dataPath))
throw new ArgumentException("dataPath is required.", nameof(dataPath));
if (string.IsNullOrEmpty(targetPath))
throw new ArgumentException("targetPath is required.", nameof(targetPath));

int result = NativeBfTreeMethods.bftree_snapshot(handle);
if (result != 0)
throw new InvalidOperationException("Failed to snapshot disk-backed BfTree before file copy.");

System.IO.File.Copy(dataPath, targetPath, overwrite: true);
}
Comment on lines 40 to +46
| Trigger | When | What it does for RangeIndex |
|---------|------|---------------------------|
| `OnFlush` | Page moves to read-only | Snapshot BfTree to `flush.bftree` under exclusive lock; set `FlagFlushed` |
| `OnFlush(addr)` | Page moves to read-only | Branch on `stub.TreeHandle`: `!=0` → `BfTreeService.SnapshotToFileByPtr(handle, dataPath, <addr:x16>.flush.bftree)`; `==0` → `File.Copy(data.bftree → <addr>.flush.bftree)`. Set `FlagFlushed`. |
| `OnEvict` | Page evicted past HeadAddress | Free BfTree under exclusive lock via `DisposeTreeUnderLock`; data files preserved for lazy restore |
| `OnDiskRead` | Record loaded from disk | Zero `TreeHandle` (native pointer is stale) |
| `OnCheckpoint(VersionShift)` | PREPARE→IN_PROGRESS | Set checkpoint barrier; mark all live trees `SnapshotPending=1` |
| `OnCheckpoint(FlushBegin)` | WAIT_FLUSH | Snapshot trees with `SnapshotPending=1` under exclusive lock; clear barrier |
| `OnCheckpoint(CheckpointCompleted)` | REST | Purge old checkpoint snapshot files (if `removeOutdated`) |
| `OnRecovery(Guid)` | Before snapshot file recovery | Store recovered checkpoint token |
| `OnRecoverySnapshotRead` | Per record from snapshot file | Set `FlagRecovered` on RI stubs |
| `OnDispose(Deleted)` | DEL/UNLINK | Free BfTree under exclusive lock; delete data files (`data.bftree`, `flush.bftree`, snapshots, key directory) |
| `OnDiskRead` | Record loaded from disk | Zero `TreeHandle` (native pointer is stale); no file work |
| `PostCopyToTail` | After `TryCopyToTail` CAS, before unseal | Branch on `src.TreeHandle`: `!=0` live transfer (clear src.TreeHandle); `==0` cold pre-stage (`PreStageAndRegisterPending`). Clear `FlagFlushed` on dst. |
| `OnTruncate(newBA)` | After device truncate | Delete `<hash>.<addr:x16>.flush.bftree` files where `addr < newBA` plus orphaned `.tmp` files |
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants