Update lifecycle for compaction, truncate, improve bftree file management by badrishc · Pull Request #1780 · microsoft/garnet

badrishc · 2026-05-08T18:24:43Z

No description provided.

Copilot

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

…WInfo.SourceAddress Foundational Tsavorite-side changes for BfTree (RangeIndex) compaction lifecycle integration: - IRecordTriggers.OnFlush gains a logicalAddress parameter so applications can name per-flush snapshot files immutably. - New IRecordTriggers.PostCopyToTail(in src, srcAddr, ref dst, dstAddr) trigger fires from TryCopyToTail after CAS-success but BEFORE UnsealAndValidate, so dst is sealed during the callback (concurrent readers see SkipOnScan and retry). Re-introduces the post-CAS hook that PR 5168a2f removed (PostSingleWriter), needed by RangeIndex compaction path. - New IRecordTriggers.OnTruncate(newBA) trigger fires from AllocatorBase.TruncateUntilAddressBlocking AFTER device truncation completes; also from the recovery-time trim path. Lets applications clean up external state (e.g. per-flush snapshot files) tied to log addresses below the new BeginAddress. - RMWInfo.SourceAddress added so RIPROMOTE PostCopyUpdater can plumb the source address (Address gets reassigned to the destination address by the time CopyUpdater runs). - PendingContext.originalAddress is now also used by TryCopyToTail's PostCopyToTail fire path to surface the source logical address. Set at three call sites: CompactionConditionalCopyToTail (currentAddress), InternalRead's CopyFromImmutable (recSrc.LogicalAddress), and ContinuePending (request.logicalAddress). - SpanByteScanIterator and ObjectScanIterator now invoke OnDiskRead in the disk-frame branch so disk-source scan records arrive at PostCopyToTail with invalidated TreeHandle bytes. Storage tier baseline preserved: BasicLockTests, SpanByteLogCompactionTests, and BasicStorageTests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…, address-tagged flush, PostCTT) Refactors RangeIndexManager and GarnetRecordTriggers to fix three defects in BfTree lifecycle integration: DEFECTS FIXED - Recovery sees post-checkpoint BfTree state (silent corruption): a single deterministic flush.bftree per key was overwritten on every flush, so a recover-to-prior-checkpoint read the wrong state. Replaced with immutable, address-tagged <hash>.<addr:x16>.flush.bftree files. - Compaction's TryCopyToTail leaves dst with a dangling TreeHandle: the previous PostSingleWriter hook was removed in 5168a2f. Re-introduce via the new IRecordTriggers.PostCopyToTail trigger; for live-source the trigger transfers ownership (clears src.TreeHandle), for disk-source it pre-stages data.bftree from the per-flush snapshot and registers a pending entry. - Shift/ShiftForced (and ad-hoc Log.Truncate) leak orphan flush files: handled by the new OnTruncate trigger, which deletes per-flush snapshots whose addr < newBA and cleans orphaned .tmp files. LATENT BUG ALSO FIXED - RIPROMOTE'd dst with TreeHandle=0 captured by checkpoint had no snapshot file → recovery to that checkpoint failed. RIPROMOTE PostCopyUpdater now branches on src.TreeHandle: !=0 live transfer (existing behavior); ==0 cold case pre-stages data.bftree from <srcAddr>.flush.bftree and registers a pending entry so the checkpoint snapshot captures it correctly. KEY DESIGN CHANGES - liveIndexes re-keyed by Guid (XxHash128 of the key bytes, same scheme as the on-disk filename prefix). Hot-path access only for checkpoint coordination (WaitForTreeCheckpoint), gated by checkpointInProgress short-circuit. - TreeEntry.Tree is nullable: null = pending (data.bftree pre-staged but no native tree opened yet). Pending entries are upgraded to activated on first RestoreTree call. - TWO roots, mirroring Tsavorite's hlog/cpr-checkpoints separation: riLogRoot = {LogDir ?? CheckpointDir ?? cwd}/Store/rangeindex/ (working+flush) cprDir/<token>/rangeindex/ (per-checkpoint) Per-checkpoint snapshots are deleted automatically when Tsavorite removes the parent token directory; PurgeOldCheckpointSnapshots removed. - Flat layout (no per-key subdirectories): single Directory.CreateDirectory at startup, single EnumerateFiles for OnTruncate, no orphan-dir cleanup needed. - RestoreTree (renamed from RestoreTreeFromFlush) is now trivial: opens data.bftree directly. Pre-staging always happened earlier (PostCopyToTail-cold, RIPROMOTE-cold, OnRecoverySnapshotRead). - OnRecoverySnapshotRead pre-stages data.bftree from cpr-checkpoints/<recoveredToken>/ rangeindex/<hash>.bftree DURING recovery (snapshot files may be deleted post-recovery). - Storage tier optional: works without LogDir (falls back to CheckpointDir → cwd). API CHANGES (Tsavorite) - IRecordTriggers.OnFlush now takes (ref LogRecord, long logicalAddress). - New IRecordTriggers.PostCopyToTail<TSource>(in src, srcAddr, ref dst, dstAddr). - New IRecordTriggers.OnTruncate(long newBeginAddress). - RMWInfo.SourceAddress added; preserved through CopyUpdater so PostCopyUpdater can see the source address (rmwInfo.Address gets reassigned to dst by then). - PendingContext.originalAddress is now also used by TryCopyToTail's PostCopyToTail fire path; set at three call sites (compaction, CopyReadsToTail, ContinuePending). - SpanByteScanIterator and ObjectScanIterator now invoke OnDiskRead in the disk-frame branch. API CHANGES (Garnet) - BfTreeService.SnapshotToFileByPtr(handle, dataPath, targetPath) — static helper for OnFlush which has only the stub's TreeHandle, not the managed wrapper. - RangeIndexManager.IsTreeLive removed; metrics use stub.TreeHandle != 0 directly. - RangeIndexManager constructor signature: (enabled, riLogRoot, cprDir, removeOutdated, logger). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…sk-cleanup tests for new layout Tsavorite: new RecordTriggersExtTests (9 tests) covering OnFlush(addr), OnTruncate, PostCopyToTail, plus default-trigger flag verification. Updates RecordLifecycleTests' LifecycleRecordTriggers.OnFlush to the new (ref LogRecord, long) signature so the existing CallOnFlushGatesOnFlushInvocation test fires correctly. Garnet: updates RIDiskFileCleanupOnDeleteTest and RIDiskFileCleanupOnDeleteAfterEvictionAndRestoreTest to expect the new flat file layout under {LogDir}/Store/rangeindex/<hash>.* instead of the old {CheckpointDir}/checkpoints/rangeindex/<hash>/ per-key directories. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…te orphan-tmp cleanup RIFlushFilesAreImmutablePerAddressTest: directly verifies the on-disk invariant that fixes Defect 1. Inserts v1, force-flushes (creates <hash>.<A1>.flush.bftree), captures file content, mutates to v2, force-flushes again. Asserts the v1 file STILL exists with byte-identical content (proving immutability) AND that a NEW v2 file was created at a distinct address (proving second flush did not overwrite). Pre-fix would fail: single deterministic flush.bftree was overwritten on every flush. RIOnTruncateRemovesOrphanTmpFilesTest: drops a bogus .data.bftree.tmp file in the riLogRoot, triggers ShiftBeginAddress(_, truncateLog: true), and asserts the orphan tmp file is deleted. Verifies the new IRecordTriggers.OnTruncate trigger fires post- device-truncate and that GarnetRecordTriggers.OnTruncate cleans .tmp files. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Updates the IRecordTriggers Lifecycle table for the new triggers (PostCopyToTail, OnTruncate) and the OnFlush(addr) signature change. Adds new sections: - File Layout (two roots): riLogRoot for log-tied files, cprDir/<token>/rangeindex/ for per-checkpoint snapshots; both flat layouts (no per-key subdirs). - liveIndexes — Guid-keyed single dictionary; activated vs pending entries. - liveIndexes hot-path discipline (only WaitForTreeCheckpoint, gated by short-circuit). - Updated Lazy Promote to describe the new RIPROMOTE PostCopyUpdater cold branch. - Updated Lazy Restore to describe the trivial RestoreTree (data.bftree direct open). - Updated Checkpoint Consistency: SnapshotPending=1 on activated AND pending entries; pending → File.Copy(data.bftree → snapshot path). - New Recovery Flow section: OnRecoverySnapshotRead pre-stages above-FUA stubs DURING recovery; below-FUA stubs handled lazily via RIPROMOTE-cold + per-flush snapshots. - New Compaction Lifecycle section: PostCopyToTail live transfer vs cold pre-stage. - New Scratchpad: Key Plumbing for PostCopyToTail (originalAddress + RMWInfo.SourceAddress). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… cleanup The previous guard 'if (deleteFiles && stub.StorageBackend == Disk) DeleteTreeFiles(key)' was a fragile defense that: (a) Fails defense in depth: DeleteTreeFiles is already idempotent (enumerate-and-delete on a hash prefix; no-op if no files match), so the backend check adds no protection while creating a path-dependent leak. (b) Forecloses future memory-snapshot support: the native bftree library exposes bftree_snapshot_memory / bftree_recover_memory (currently NotSupported but planned). When that ships, a memory-backed tree could legitimately have on-disk artifacts via OnFlush / PreStageAndRegisterPending. Without removing this guard, DEL on such a key would leak files. (c) Fails to clean up stale prior-incarnation files: any <hash>.* file existing on disk when the key is re-created (or when the backend value is mutated through a future code path) would be permanently unreclaimable on DEL. Drop the guard. DeleteTreeFiles always cleans up any <hash>.* files in riLogRoot when deleteFiles=true, regardless of current StorageBackend. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…oundUpToPowerOf2 These were inadvertently dropped during the larger refactor of RangeIndexManager.cs. Restoring the original docs: - ComputeLeafPageSize: full <param>/<returns> describing the size buckets. - ComputeLeafPageSize body: '2.5x, capped at 32KB' and 'Round up to next power of 2' inline comments at the bucket-computation steps. - RoundUpToPowerOf2: <summary> describing the bit-manipulation algorithm. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… File.Move pattern) Both PreStageAndRegisterPending and RebuildFromSnapshotIfPending used a .tmp + File.Move atomic-rename pattern that on closer inspection adds no safety: PreStageAndRegisterPending callers (PostCopyToTail, RIPROMOTE PostCopyUpdater) run BEFORE UnsealAndValidate on the destination record, so concurrent readers see SkipOnScan and retry — no read race possible. Tsavorite's hash-bucket CAS serializes concurrent writers per key (only one CopyUpdater/TryCopyToTail succeeds per CAS event), so two PreStage calls cannot race on the same data.bftree file. OnEvict closes the prior native BfTree before any cold-restore can happen, so no live native fd is open on data.bftree when this runs. RebuildFromSnapshotIfPending runs DURING recovery, which is single-threaded with no concurrent readers. A crash mid-copy is self-healing: the next recovery attempt re-fires OnRecoverySnapshotRead for the same stub and overwrites any partial file before any RestoreTree can observe it. Both functions now use direct File.Copy(src, dst, overwrite: true). Drop the LogTmpPath helper (no longer used); keep the .tmp suffix cleanup in OnTruncateImpl as defense in depth for any legacy or future atomic-rename code paths. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…exclusive lock Critical concurrency fix exposed by the question 'is RIPROMOTE running under exclusive lock?'. The answer is no: • ReadRangeIndex RELEASES the shared lock before invoking PromoteToTail. • There is no TryPromoteSharedLock upgrade anywhere in RangeIndex code. • RIPROMOTE's RMW (CopyUpdater + PostCopyUpdater) runs with no rangeIndex lock held. Combined with: CASRecordIntoChain in Helpers.cs:177 unseals dst IMMEDIATELY on CAS success, so by the time PostCopyUpdater fires, concurrent readers can already observe dst with TreeHandle == 0 and invoke RestoreTree, which opens data.bftree directly. Race scenario (cold case): 1. Thread A: PostCopyUpdater starts PreStageAndRegisterPending — File.Copy partway done. 2. Thread B: ReadRangeIndex sees the unsealed dst with TreeHandle=0, releases shared lock, acquires exclusive lock for RestoreTree. 3. Thread B: RestoreTree opens data.bftree → reads partial file → corruption. Atomic-rename .tmp + File.Move doesn't fix this either — Thread B might see 'file doesn't exist' before the rename completes and return spurious NOTFOUND. Fix: PreStageAndRegisterPending now acquires the per-key EXCLUSIVE rangeIndex lock for the duration of File.Copy AND the pending-entry registration. This blocks RestoreTree's exclusive acquire until pre-stage completes; Thread B then sees the fully-written data.bftree. A direct File.Copy is sufficient under this lock — the exclusive lock serializes against any reader that would open data.bftree, against other concurrent PreStageAndRegisterPending calls for the same key, and crash-mid-copy is self-healing via OnRecoverySnapshotRead or RIPROMOTE-cold re-pre-stage. RebuildFromSnapshotIfPending continues to use direct File.Copy without a lock — recovery is single-threaded with no concurrent readers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…lush invariant, OnTruncate) ISSUE 1 (BLOCKER): Stale source eviction removed the live entry. After PostCopyToTail-live or RIPROMOTE-live, the source record's TreeHandle was cleared but the record otherwise stayed in the chain. When the source page later evicted, OnEvict→DisposeTreeUnderLock saw TreeHandle==0 and hit the early branch, removing the liveIndexes entry that NOW belonged to the destination — losing checkpoint coverage and DEL-time native-tree disposal. Fix: Add RangeIndexStub.FlagTransferred = 4 (uses one of the reserved Flag bits; no stub-layout change). Both transfer paths now set this flag on src in BOTH live and cold branches: • GarnetRecordTriggers.PostCopyToTail (live + cold). • RMWMethods.PostCopyUpdater (RIPROMOTE live + cold). DisposeTreeUnderLock checks FlagTransferred FIRST — when set, no-op (the entry belongs to a newer destination record). For DEL/UNLINK (deleteFiles=true) we still proceed with file cleanup. GarnetRecordTriggers.OnFlush also checks FlagTransferred — a transferred-out src must NOT snapshot a stale view (TreeHandle was cleared, and any snapshot from data.bftree would race with dst's active mutations). Regression test: RIDisposeTreeUnderLockNoOpsOnTransferredSourceTest. Verified the test fails when the FlagTransferred check is removed. ISSUE 2 (MAJOR): Pre-stage failures left dst in a broken state. PreStageAndRegisterPending logged a warning and returned silently on missing source flush file or copy failure. The caller (PostCopyToTail / PostCopyUpdater) cleared FlagFlushed unconditionally, leaving an above-FUA dst stub with TreeHandle=0, FlagFlushed=0, no data.bftree, and no pending entry — violating Invariants 4 and 5. Same problem in OnFlush: when TreeHandle=0 and data.bftree was missing, OnFlush silently skipped the snapshot but still set FlagFlushed → unrestorable state. Fix: PreStageAndRegisterPending logs as ERROR (not warning) on source-missing and explicitly does NOT fall back to any other flush file (per user direction: recovering the wrong tree version is worse than failing). The destination's RestoreTree will return NOTFOUND for the affected key, surfacing the data loss explicitly. OnFlush also logs as ERROR and DOES NOT set FlagFlushed when the source data.bftree is missing — the next access routes through RestoreTree which returns NOTFOUND for the key while leaving the record otherwise consistent. ISSUE 3 (MAJOR): Cluster mode disabled all RangeIndex truncation cleanup. OnTruncateImpl returned early when removeOutdatedCheckpoints=false (cluster mode), leaking obsolete <hash>.<addr>.flush.bftree files and stale .tmp files indefinitely. Per-flush files are LOG-tied (their lifetime tracks log addresses), not checkpoint-tied — they are safe to delete once Tsavorite's BeginAddress passes their address, regardless of cluster mode. removeOutdatedCheckpoints controls CHECKPOINT-token-tied snapshot retention only. cleanup is unconditional. All 61 RangeIndex tests + 35 Tsavorite tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Bug: DeleteTreeFiles indiscriminately deleted ALL <hash>.* files in riLogRoot, including <hash>.<addr:x16>.flush.bftree files that may still be required to recover an OLDER checkpoint taken BEFORE the DEL operation. Pre-fix data-loss flow: T=0: RI.CREATE key K, write, flush → <hash>.<A1>.flush.bftree T=1: SAVE (checkpoint C1) T=2: more writes, more flushes → <hash>.<A2>.flush.bftree T=3: DEL K → ALL <hash>.* files deleted, including <A1>.flush.bftree T=4: crash; recover to C1 → recovered log has stub at A1 with FlagFlushed=1 T=5: RI.GET → RIPROMOTE-cold → looks for <hash>.<A1>.flush.bftree → MISSING → silent data loss for the recovered key. Fix: DeleteTreeFiles now only deletes the WORKING files: • <hash>.data.bftree • <hash>.data.bftree.tmp (defensive, in case of any legacy atomic-rename leftover) Per-flush snapshot files (<hash>.<addr:x16>.flush.bftree) are LOG-tied: they may be needed to recover an older checkpoint, and are reclaimed by OnTruncate only when Tsavorite's BeginAddress passes their address — at which point no checkpoint can recover the pre-delete state anyway. Per-checkpoint snapshots under cpr-checkpoints/<token>/rangeindex/<hash>.bftree are already not touched here (different root); Tsavorite removes them with the parent token directory. Side effect: DeleteTreeFiles no longer iterates the directory — direct File.Delete on two known paths. The ONLY remaining directory iteration in production code is now OnTruncateImpl (which is unavoidable: it scans for files matching addr<newBA). Updated tests: • RIDiskFileCleanupOnDeleteTest now only asserts data.bftree is removed. • RIDiskFileCleanupOnDeleteAfterEvictionAndRestoreTest now asserts that the flush.bftree files PRESERVE their count after DEL (not zero). • New RIDeletePreservesPerFlushSnapshotFilesTest explicitly asserts the LOG-tied lifetime contract: DEL preserves <hash>.<addr>.flush.bftree files. All 62 RangeIndex tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

No production path produces <hash>.data.bftree.tmp files. Every file-write path in RangeIndexManager uses direct File.Copy(overwrite: true): • PreStageAndRegisterPending (under per-key exclusive lock) • RebuildFromSnapshotIfPending (single-threaded recovery) • Per-checkpoint snapshot capture Verified across libs/ that no File.Move calls exist anywhere, and the Rust BfTree library does not produce .tmp files either. The .tmp handling that existed was speculative cleanup for 'legacy or future atomic-rename code paths' (per its own doc comment). There is no legacy deployment of this format (this PR introduces it), and there is no plan to migrate to a .tmp + Move pattern (current design relies instead on: - Crash-safety of partial data.bftree via re-pre-stage on next access / next OnRecoverySnapshotRead overwrite. - The per-key exclusive lock serializing readers vs the in-flight Copy.) Removed: • DeleteTreeFiles: deletes only <hash>.data.bftree (one File.Delete, no helper closure, no enumeration). Updated xmldoc. • OnTruncateImpl: dropped the .data.bftree.tmp branch (~6 lines + comment); only enumerates / matches <hash>.<addr:x16>.flush.bftree now. • RebuildFromSnapshotIfPending: dropped 'cf. .tmp + File.Move' aside from xmldoc; kept the crash-safety rationale. • GarnetRecordTriggers.OnDiskRead-cold branch: dropped '(atomic via .tmp + File.Move)' aside. • DisposeTreeUnderLock xmldoc: simplified deleteFiles description (no longer references .tmp). • Test RIOnTruncateRemovesOrphanTmpFilesTest: removed (purely synthetic; asserted behavior of code with no production caller). All 61 RangeIndex tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Refactor RangeIndexStub flag handling to mirror Tsavorite's RecordInfo pattern: typed properties for each bit, private backing field, centralized reset, and explicit reservation of bits for future use. Changes ------- RangeIndexStub: • New top-of-struct comment block documenting the full bit layout: [Unused7][Unused6][Unused5][Pinned*][Modified*][Transferred][Recovered][Flushed] • Bit constants: kFlushedBitMask, kRecoveredBitMask, kTransferredBitMask (private const byte; mask-only form since byte is small enough that offset/mask split adds no value). • Bits 3 (Modified) and 4 (Pinned) RESERVED — masks deferred until those semantics actually exist (avoids inviting premature usage). Bits 5..7 free for future expansion → ≥5 bits headroom after Modified+Pinned land. • IsFlushed / IsRecovered / IsTransferred are now SETTABLE properties (was readonly-get only) — get returns (flags & mask) != 0; set uses the standard ternary pattern. Inlined by JIT, byte-identical codegen vs the prior 'stub.Flags |= mask;' calls. • 'public byte Flags' renamed to 'private byte flags' — all writes must now go through typed properties or ResetFlags(), centralizing invariant control as more bits gain semantics. • New ResetFlags() instance method (AggressiveInlining) for the two legitimate full-reset callsites. • Old 'internal const byte FlagFlushed/FlagRecovered/FlagTransferred' members removed. RangeIndexManager.Index.cs: • CreateIndex: 'stub.Flags = 0' → 'stub.ResetFlags()'. • RecreateIndex: '&= ~FlagRecovered' → 'stub.IsRecovered = false'. • Static span-based wrappers (SetFlushedFlag, ClearFlushedFlag, SetTransferredFlag, MarkRecoveredFromCheckpoint) reimplemented via the new properties. Helper signatures + AggressiveInlining unchanged so all callsites compile without modification and produce identical machine code. • Composite helpers (ClearTreeHandle, InvalidateStub, MarkRecoveredFromCheckpoint) kept as separate semantic entry points — their underlying writes may converge today but encode different invariants (ownership transfer vs disk-load staleness vs recovery marker). RangeIndexManager.cs: • HandleRangeIndexCreateReplay (AOF replay): 'stub.Flags = 0' → 'stub.ResetFlags()'. • Doc xref FlagFlushed → IsFlushed. RangeIndexOps.cs: • Object initializer 'Flags = 0' line removed (struct default-initializes the private flags byte to 0). RMWMethods.cs / RespRangeIndexTests.cs: • Comment / message references to FlagFlushed/FlagRecovered/FlagTransferred updated to IsFlushed/IsRecovered/IsTransferred. Wire format ----------- Unchanged. Byte 33 (the flags byte) keeps the same bit assignments. Existing checkpoint snapshots, AOF entries, and on-disk records remain readable. The test that constructs a raw 35-byte stub via 'staleStub[33] = 4' continues to exercise the on-the-wire format directly (now with corrected boolean wording in its assertion messages). Perf ---- Zero. Property setters/getters are inlined; literal true/false at every callsite folds the setter ternary at JIT time → same OR/AND/STORE as before. 'private' is purely a compile-time access modifier with no runtime cost. All 61 RangeIndex tests pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Comments -------- Make all RangeIndex comments self-contained, present-tense, and accurate: • Replace stale FlagFlushed/FlagRecovered/FlagTransferred references in comments and doc strings with the current property names IsFlushed/ IsRecovered/IsTransferred (also normalize 'FlagX=1' to 'IsX=true'). • Test RIFlushFilesAreImmutablePerAddressTest: rewrite xmldoc as a behavioral spec for the per-flush snapshot file immutability invariant (drop 'Defect 1', 'Pre-fix behavior', 'Post-fix behavior'); rename test key 'defect1key' → 'flushtestkey'; remove 'Defect 1 invariant' label from assertion message. • Test RIDisposeTreeUnderLockNoOpsOnTransferredSourceTest: rewrite xmldoc and assertion messages as behavioral specs (drop 'GPT-5.5-flagged blocker', 'The bug: pre-fix...', 'The fix: ...'). The discriminating contrast (without IsTransferred → entry IS removed) remains as a present-tense spec check. • GarnetRecordTriggers.OnFlush cold branch: rephrase 'Invariant 1+5' / 'violates Invariant 5' as 'per-flush snapshot invariant has been violated' and inline what the consequence is. Format ------ Tsavorite test file RecordTriggersExtTests.cs had a trailing newline beyond the canonical single-newline EOF; removed. Verification ------------ • dotnet format Garnet.slnx --verify-no-changes → clean • dotnet format Tsavorite.slnx --verify-no-changes → clean • dotnet build test/Garnet.test → 0 warnings, 0 errors • RespRangeIndexTests (61 tests) → all pass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Intermittent CI failure on this PR (and any PR touching native bindings): Error: cargo.exe failed with exit code 101 command: cargo metadata --all-features --format-version 1 --no-deps stderr: error: could not find Cargo.toml in D:\a\garnet\garnet or any parent directory Root cause ---------- actions-rust-lang/setup-rust-toolchain@v1 defaults to cache: true with cache-workspaces: '. -> target'. Internally it invokes Swatinem/rust-cache, which runs 'cargo metadata' from the repo root to build the cache key. The Garnet repo has no root Cargo.toml — the only Cargo workspace lives at libs/native/bftree-garnet — so the metadata call exits 101. The failure is intermittent because the cache action sometimes recovers (falls back to the lockfile-based key it already discovered) and sometimes surfaces the exit 101 as a step failure. Fix --- Pass cache-workspaces: libs/native/bftree-garnet to the action so rust-cache runs cargo metadata from the directory that actually contains Cargo.toml. The cache-key derivation continues to work and the spurious exit-101 error goes away. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Regression introduced in commit 30885d3 ('RangeIndex: refactor for compaction lifecycle'): the dev branch's OnFlush trigger delegated to RangeIndexManager. SnapshotTreeForFlush(), which acquired the per-key exclusive lock around the native bftree snapshot call. The refactor inlined the snapshot body directly into the OnFlush trigger as a raw BfTreeService.SnapshotToFileByPtr(...) call with NO lock — silently dropping the serialization guarantee. Why this matters ---------------- At the moment OnFlush fires on a record, that record: - Is the chain head (no RIPROMOTE has occurred yet — IsTransferred=false). - Has TreeHandle pointing at the live native tree. - Has IsFlushed=false (about to be set TO true by OnFlush at the end). Worker threads doing RI.SET / RI.DEL on this record use the same TreeHandle to call native InsertByPtr / DeleteByPtr while holding the per-key SHARED lock. Without the per-key exclusive lock in OnFlush, the native SnapshotToFileByPtr can run concurrently with InsertByPtr on the same tree. If the BfTree native lib does not fully serialize snapshot vs insert internally, this races and can hang or corrupt. This is the root cause of the rare RIConcurrentOpsWithCheckpointTest hang in CI. The test maximizes the race window by having 4 worker threads tight-looping RI.SET on a single key while a 5th thread takes a SAVE checkpoint — the IN_PROGRESS → WAIT_FLUSH transition in Tsavorite's checkpoint state machine triggers OnFlush per record, racing with workers that entered before SetCheckpointBarrier set checkpointInProgress=true. The IsTransferred early-return at the top of OnFlush is NOT enough on its own. It only short-circuits re-firings of OnFlush on a stale source AFTER the source has been transferred to a tail destination via PostCopyToTail or RIPROMOTE PostCopyUpdater. For the FIRST OnFlush on any record, IsTransferred=false because the transfer is triggered BY IsFlushed=true, which OnFlush is the one to set. Fix --- Centralize the snapshot-for-flush logic in RangeIndexManager.SnapshotTreeForFlush(key, valueSpan, logicalAddress) — mirroring the dev branch pattern. Acquire the per-key exclusive lock around the snapshot/copy body. The OnFlush trigger is now a thin delegation. This restores the serialization guarantee against: - Concurrent worker InsertByPtr/DeleteByPtr on the same TreeHandle. - Concurrent SnapshotAllTreesForCheckpoint on the same TreeHandle (which already holds the same per-key exclusive lock). - Concurrent PreStageAndRegisterPending data.bftree rewrite (relevant to the cold-path File.Copy(data.bftree -> flush.bftree)). The IsFlushed flag is now set INSIDE the lock body on success, and is explicitly NOT set on the cold-path-data.bftree-missing path — preserving the existing 'route through RestoreTree -> NOTFOUND' contract for unrestorable records. (This is a small improvement over the prior code, which set IsFlushed unconditionally after the native call returned.) Test safety ----------- Added [CancelAfter(120_000)] to RIConcurrentOpsWithCheckpointTest so any remaining or unrelated hang in this stress test surfaces as an NUnit timeout failure with a stack trace, in addition to CI's existing --blame-hang-timeout dump. Verification ------------ - dotnet format Garnet.slnx --verify-no-changes -> clean - dotnet build test/Garnet.test -> 0 warnings, 0 errors - 61 RespRangeIndexTests -> all pass Note: holding additional fixes pending the dev-vs-PR design audit currently in progress (will land as separate commits if/when surfacing other lost design aspects). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Restore three protective measures that were silently weakened during the compaction-lifecycle refactor (commit 30885d3). 1. SnapshotTreeForFlush: don't gate memory-backend re-anchor on riLogRoot --------------------------------------------------------------------------- The first line of SnapshotTreeForFlush early-returned on 'string.IsNullOrEmpty(riLogRoot)' BEFORE reaching the 'StorageBackend == Memory' branch that calls SetFlushedFlag. For a memory-backed RI tree in any configuration where riLogRoot is null/empty (unit tests instantiating RangeIndexManager(true) directly, or any future deployment without a log dir), the memory branch was unreachable. Without IsFlushed=true, ReadRangeIndex never triggers PromoteToTail (RIPROMOTE) — the stub stays frozen in the read-only region. On first page eviction, OnEvict -> DisposeTreeUnderLock disposes the live BfTree (no IsTransferred guard fires). All subsequent accesses route through RestoreTree -> no data.bftree (memory tree never had one) -> silent NOTFOUND on a key that was supposed to live in memory. In production today this is unreachable because GarnetServer.cs:321 always computes a non-empty riLogRoot. Defense-in-depth fix: reorder so the memory branch fires regardless of riLogRoot, and only gate disk-backed paths on the riLogRoot check. 2. DisposeTreeUnderLock: take per-key lock around cold-eviction TryRemove ------------------------------------------------------------------------- The cold-eviction branch (TreeHandle==0, deleteFiles=false) early-returned with 'liveIndexes.TryRemove(KeyId(key), out _)' OUTSIDE the per-key exclusive lock. Every other liveIndexes mutator (RegisterIndex, UnregisterIndex, PreStageAndRegisterPending) holds the lock — this single instance broke the discipline. ConcurrentDictionary is thread-safe at the operation level, but invariants spanning multiple operations need external serialization. Race scenario: SetCheckpointBarrier marks pending entry's SnapshotPending=1, then a concurrent cold eviction TryRemoves it before SnapshotAllTreesForCheckpoint captures it. Result: silent checkpoint coverage gap for the affected key — recovery from that checkpoint produces NOTFOUND. Fix: acquire the lock first, then dispatch on TreeHandle/deleteFiles inside the locked block. Cost: one extra lock acquisition on the cold-eviction path, with no shared-lock contention because the page is being torn down. 3. RestoreTree: re-add Debug.Assert for missing data.bftree ----------------------------------------------------------- Dev had Debug.Assert(File.Exists(restorePath), '...') for the recovered-stub restore path. The PR's two-branch restore (checkpoint vs flush) was replaced by a single 'always use data.bftree' path, and the assert was softened to a LogWarning + return false. The data.bftree pre-staging invariant ('every above-FUA stub with TreeHandle=0 must have a valid data.bftree') is load-bearing for the lazy- restore chain. If pre-staging breaks (in PostCopyToTail-cold, RIPROMOTE PostCopyUpdater-cold, or OnRecoverySnapshotRead), the current code surfaces the failure as a key silently returning NOTFOUND. The original Debug.Assert would have caught it loudly in Debug-build CI. Fix: restore the Debug-build assert immediately before the LogWarning. Release builds: behavior unchanged. Debug builds (CI): assertion failure with stack trace, immediately surfacing any pre-stage chain bug. Skipped from audit ------------------ - Finding 4 (StorageBackend == Disk guard dropped from DeleteTreeFiles call): pure defensive-coding loss; DeleteTreeFiles guards via File.Exists, memory trees no-op. Not worth a fix. - Finding 5 (PurgeOldCheckpointSnapshots dropped, cleanup delegated to Tsavorite): needs verification that Tsavorite's per-token directory cleanup correctly reaches the rangeindex/ subdir. Investigation deferred — not a fix. Verification ------------ - dotnet format Garnet.slnx --verify-no-changes -> clean - dotnet build test/Garnet.test -> 0 warnings, 0 errors - 61 RespRangeIndexTests -> all pass Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 27 out of 27 changed files in this pull request and generated 8 comments.

            // Insert the new record by CAS'ing either directly into the hash entry or splicing into the readcache/mainlog boundary.
            var success = CASRecordIntoChain(newLogicalAddress, ref newLogRecord, ref stackCtx);
            if (success)
            {
                // We don't call PostInitialWriter here so we must do the size tracking separately.
                hlogBase.logSizeTracker?.UpdateSize(in newLogRecord, add: true);
+
+                // Fire the application-level post-CAS hook BEFORE unsealing dst, so concurrent readers
+                // observing dst still see SkipOnScan and retry. The hook may mutate dst's value bytes
+                // and/or the source record (similar to RIPROMOTE PostCopyUpdater for live transfers).
+                if (storeFunctions.CallPostCopyToTail)
+                {
+                    var srcLogicalAddress = stackCtx.recSrc.HasMainLogSrc
+                        ? stackCtx.recSrc.LogicalAddress
+                        : pendingContext.originalAddress;
+                    storeFunctions.PostCopyToTail(in inputLogRecord, srcLogicalAddress, ref newLogRecord, newLogicalAddress);
+                }
+
                newLogRecord.InfoRef.UnsealAndValidate();



+                {
+                    var srcLogicalAddress = stackCtx.recSrc.HasMainLogSrc
+                        ? stackCtx.recSrc.LogicalAddress
+                        : pendingContext.originalAddress;


-            this.dataDir = dataDir;
+            this.riLogRoot = riLogRoot;
+            this.cprDir = cprDir;
            this.removeOutdatedCheckpoints = removeOutdatedCheckpoints;


+| `OnTruncate(newBA)` | After device truncate | Delete `<hash>.<addr:x16>.flush.bftree` files where `addr < newBA` plus orphaned `.tmp` files |
+| `OnCheckpoint(VersionShift)` | PREPARE→IN_PROGRESS | Set checkpoint barrier; mark all entries (activated AND pending) `SnapshotPending=1` |
+| `OnCheckpoint(FlushBegin)` | WAIT_FLUSH | Snapshot trees under exclusive lock; activated → `Tree.SnapshotToFile`; pending → `File.Copy(data.bftree → snapshot path)`; clear barrier |
+| `OnCheckpoint(CheckpointCompleted)` | REST | No-op — Tsavorite removes per-token snapshot dirs when `removeOutdated=true`; per-flush files cleaned by `OnTruncate` |
+| `OnRecovery(token)` | Before snapshot file recovery | Store recovered checkpoint token (used by `RebuildFromSnapshotIfPending`) |
+| `OnRecoverySnapshotRead` | Per record from snapshot file | Set `FlagRecovered`; pre-stage `data.bftree` from `cpr-checkpoints/<token>/rangeindex/<hash>.bftree` and register pending entry (snapshot files may be deleted post-recovery) |
+| `OnDispose(Deleted)` | DEL/UNLINK | Free BfTree under exclusive lock; delete all `<hash>.*` files in the riLogRoot |
+
+### File Layout (two roots)
+
+RangeIndex files live under **two roots**, mirroring Tsavorite's separation of log state
+(`hlog`) from checkpoint state (`cpr-checkpoints`):
+
+**Log-tied root** — `{LogDir ?? CheckpointDir ?? cwd}/Store/rangeindex/`
+Co-located with `hlog.<seg>` when storage tier is enabled. Falls back through the same chain
+that Tsavorite uses for `CheckpointBaseDirectory` so RangeIndex works without storage tier.
+Lifetime tracks log addresses (cleared by `OnTruncate(newBA)`).
+
+```
+{riLogRoot}/
+    <hash>.data.bftree              # bftree's working file
+    <hash>.data.bftree.tmp          # transient (atomic copy intermediate)
+    <hash>.<addr:x16>.flush.bftree  # immutable per-flush snapshot
+```


+| `OnCheckpoint(CheckpointCompleted)` | REST | No-op — Tsavorite removes per-token snapshot dirs when `removeOutdated=true`; per-flush files cleaned by `OnTruncate` |
+| `OnRecovery(token)` | Before snapshot file recovery | Store recovered checkpoint token (used by `RebuildFromSnapshotIfPending`) |
+| `OnRecoverySnapshotRead` | Per record from snapshot file | Set `FlagRecovered`; pre-stage `data.bftree` from `cpr-checkpoints/<token>/rangeindex/<hash>.bftree` and register pending entry (snapshot files may be deleted post-recovery) |
+| `OnDispose(Deleted)` | DEL/UNLINK | Free BfTree under exclusive lock; delete all `<hash>.*` files in the riLogRoot |


+            // Call DisposeTreeUnderLock as OnEvict would, with deleteFiles=false (eviction path).
+            // With IsTransferred=true, this must no-op — NOT remove the entry that belongs to
+            // the live record at the tail.
+            var keyBytes = System.Text.Encoding.ASCII.GetBytes("transtest");
+            rim.DisposeTreeUnderLock(keyBytes, staleStub, deleteFiles: false);
+
+            ClassicAssert.AreEqual(1, rim.LiveIndexCount,
+                "DisposeTreeUnderLock on a stale (IsTransferred=true) source must NOT remove the live entry " +
+                "that now belongs to the destination at the tail.");
+
+            // Live tree should still be functional.
+            ClassicAssert.AreEqual("value-v1", (string)db.Execute("RI.GET", "transtest", "field-x"));
+
+            // Discriminating contrast: a stub WITHOUT IsTransferred (pure pending entry, e.g.
+            // evicted before activation) WOULD remove the entry. This proves the discriminating
+            // power of the IsTransferred check.
+            var pendingStub = new byte[RangeIndexManager.IndexSizeBytes];
+            // IsTransferred=0, TreeHandle=0 — looks like a pending entry being evicted.
+            // Use a DIFFERENT key for this part to avoid disturbing the live entry above.
+            db.Execute("RI.CREATE", "pendkey", "DISK", "CACHESIZE", "65536", "MINRECORD", "8");
+            ClassicAssert.AreEqual(2, rim.LiveIndexCount, "second tree created");
+            // Now simulate eviction of a pending-entry stub for "pendkey".
+            var pendKeyBytes = System.Text.Encoding.ASCII.GetBytes("pendkey");
+            rim.DisposeTreeUnderLock(pendKeyBytes, pendingStub, deleteFiles: false);
+            ClassicAssert.AreEqual(1, rim.LiveIndexCount,


+        public static void SnapshotToFileByPtr(nint handle, string dataPath, string targetPath)
+        {
+            if (handle == nint.Zero)
+                throw new ArgumentException("Native handle is null.", nameof(handle));
+            if (string.IsNullOrEmpty(dataPath))
+                throw new ArgumentException("dataPath is required.", nameof(dataPath));
+            if (string.IsNullOrEmpty(targetPath))
+                throw new ArgumentException("targetPath is required.", nameof(targetPath));
+
+            int result = NativeBfTreeMethods.bftree_snapshot(handle);
+            if (result != 0)
+                throw new InvalidOperationException("Failed to snapshot disk-backed BfTree before file copy.");
+
+            System.IO.File.Copy(dataPath, targetPath, overwrite: true);
+        }


 | Trigger | When | What it does for RangeIndex |
 |---------|------|---------------------------|
-| `OnFlush` | Page moves to read-only | Snapshot BfTree to `flush.bftree` under exclusive lock; set `FlagFlushed` |
+| `OnFlush(addr)` | Page moves to read-only | Branch on `stub.TreeHandle`: `!=0` → `BfTreeService.SnapshotToFileByPtr(handle, dataPath, <addr:x16>.flush.bftree)`; `==0` → `File.Copy(data.bftree → <addr>.flush.bftree)`. Set `FlagFlushed`. |
 | `OnEvict` | Page evicted past HeadAddress | Free BfTree under exclusive lock via `DisposeTreeUnderLock`; data files preserved for lazy restore |
-| `OnDiskRead` | Record loaded from disk | Zero `TreeHandle` (native pointer is stale) |
-| `OnCheckpoint(VersionShift)` | PREPARE→IN_PROGRESS | Set checkpoint barrier; mark all live trees `SnapshotPending=1` |
-| `OnCheckpoint(FlushBegin)` | WAIT_FLUSH | Snapshot trees with `SnapshotPending=1` under exclusive lock; clear barrier |
-| `OnCheckpoint(CheckpointCompleted)` | REST | Purge old checkpoint snapshot files (if `removeOutdated`) |
-| `OnRecovery(Guid)` | Before snapshot file recovery | Store recovered checkpoint token |
-| `OnRecoverySnapshotRead` | Per record from snapshot file | Set `FlagRecovered` on RI stubs |
-| `OnDispose(Deleted)` | DEL/UNLINK | Free BfTree under exclusive lock; delete data files (`data.bftree`, `flush.bftree`, snapshots, key directory) |
+| `OnDiskRead` | Record loaded from disk | Zero `TreeHandle` (native pointer is stale); no file work |
+| `PostCopyToTail` | After `TryCopyToTail` CAS, before unseal | Branch on `src.TreeHandle`: `!=0` live transfer (clear src.TreeHandle); `==0` cold pre-stage (`PreStageAndRegisterPending`). Clear `FlagFlushed` on dst. |
+| `OnTruncate(newBA)` | After device truncate | Delete `<hash>.<addr:x16>.flush.bftree` files where `addr < newBA` plus orphaned `.tmp` files |


Copilot AI review requested due to automatic review settings May 8, 2026 18:24

Copilot started reviewing on behalf of badrishc May 8, 2026 18:26 View session

Copilot AI reviewed May 8, 2026

View reviewed changes

badrishc requested review from TedHartMS, kevin-montrose and vazois as code owners May 8, 2026 20:22

badrishc and others added 17 commits May 8, 2026 16:45

badrishc force-pushed the badrishc/bftree-compaction-flush branch from d2adb1b to baa7b40 Compare May 8, 2026 23:47

badrishc requested a review from Copilot May 9, 2026 00:14

Copilot started reviewing on behalf of badrishc May 9, 2026 00:14 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update lifecycle for compaction, truncate, improve bftree file management#1780

Update lifecycle for compaction, truncate, improve bftree file management#1780
badrishc wants to merge 17 commits intodevfrom
badrishc/bftree-compaction-flush

badrishc commented May 8, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

badrishc commented May 8, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants