Conversation
8096eca to
8849a5f
Compare
There was a problem hiding this comment.
Pull request overview
Addresses two rare CI “test host process crashed” failures by making Tsavorite allocator resets epoch-safe (preventing AVEs during concurrent scans) and hardening a list stress test against unhandled thread exceptions / Redis timeouts.
Changes:
- Make
AllocatorBase.Reset()defer page-close and page-free work viaLightEpoch.BumpCurrentEpoch, and move allocator-specific “free all pages” loops into a new virtualFreeAllAllocatedPages(). - Update
RespListTests.ListPushPopStressTestto use oneConnectionMultiplexerper worker and to capture worker exceptions instead of crashing the process. - Add a targeted regression test
VectorCleanupVsResetRaceTestsreproducing the VectorManager cleanup vs. Reset race.
Reviewed changes
Copilot reviewed 6 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs | Reworks Reset to defer close/free work through epoch; adds FreeAllAllocatedPages() hook. |
| libs/storage/Tsavorite/cs/src/core/Allocator/ObjectAllocatorImpl.cs | Moves page-free loop into FreeAllAllocatedPages() and calls Initialize() after base.Reset(). |
| libs/storage/Tsavorite/cs/src/core/Allocator/SpanByteAllocatorImpl.cs | Same as above for SpanByte allocator. |
| libs/storage/Tsavorite/cs/src/core/Allocator/TsavoriteLogAllocatorImpl.cs | Same as above for TsavoriteLog allocator. |
| test/Garnet.test/RespListTests.cs | Improves stress test robustness: per-thread mux, exception aggregation, timeout handling. |
| test/Garnet.test/VectorCleanupVsResetRaceTests.cs | New race reproducer test for vector cleanup iterator vs. store reset. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
8849a5f to
fe77212
Compare
|
Addressed all three Copilot review comments + a new finding from a GPT-5.5 review (force-pushed in fe77212): 1. 2. GPT-5.5 follow-up: race when a concurrent Reset already shifted HeadAddress: GPT-5.5 caught that the wait above was inside 3. 4. Both reproducers ( |
dcf4a12 to
b056865
Compare
5c108cd to
1a44762
Compare
…Manager cleanup vs Reset() AVE
Two independent rare CI failures, both surfacing as `Test host process
crashed` and aborting the whole test run.
## 1. `ClusterVectorSetTests.MigrateVectorSetWhileModifyingAsync` — fatal `AccessViolationException` in `VectorManager` cleanup task
### Symptom
```
Passed Garnet.test.cluster.ClusterVectorSetTests.MigrateVectorSetWhileModifyingAsync [12 s]
Fatal error. System.AccessViolationException: Attempted to read or write protected memory.
at Tsavorite.core.LogRecord.get_Info()
at Tsavorite.core.LogRecord.get_AllocatedSize()
at Tsavorite.core.ObjectScanIterator`2[...].GetPhysicalAddressAndAllocatedSize(...)
at Tsavorite.core.ObjectScanIterator`2[...].GetNext()
at Tsavorite.core.TsavoriteKVIterator`6[...].PushNext[...](...)
at Tsavorite.core.TsavoriteKV`2[...].Iterate[...](MainSessionFunctions, ...)
at Garnet.server.VectorManager+<RunCleanupTaskAsync>d__24.MoveNext()
The active test run was aborted. Reason: Test host process crashed
```
The AVE is a Corrupted-State Exception — `catch (Exception)` in
`RunCleanupTaskAsync` cannot suppress it; the runtime fails fast and the
test host crashes.
### Root cause
`Recovery.Reset()` → `hlogBase.Reset()` (in `AllocatorBase` and the
per-allocator overrides `SpanByte` / `Object` / `TsavoriteLog`) frees pages
by synchronously invoking `OnPagesClosed(...)` and a
`for (i in BufferSize) FreePage(i)` loop. Both paths ultimately call
`ReturnPage(index)`, which sets:
```csharp
pageArrays[index] = default;
pagePointers[index] = default; // ★ becomes 0
```
`Reset()`'s docstring promised *"WARNING: assumes that threads have drained
out at this point."* But Garnet's cluster re-attach paths invoke it on a
running store:
* `libs/cluster/Server/Replication/ReplicaOps/ReplicaDisklessSync.cs:100`
* `libs/cluster/Server/Replication/ReplicaOps/ReplicaDiskbasedSync.cs:136`
In both files `storeWrapper.Reset()` is called **before**
`SuspendPrimaryOnlyTasksAsync()`, and even that suspend only drains
`TaskManager` tasks — `VectorManager.cleanupTask` is independent and never
drained.
Once `pagePointers[i] = 0`, the iterator's `GetPhysicalAddress` returns
`0 + offset` — a tiny kernel-page address — and dereferencing it in
`*(RecordInfo*)physicalAddress` raises a fatal AVE.
### The exact interleaving
Production scenario in `MigrateVectorSetWhileModifyingAsync`:
1. Source primary migrates a slot containing a vector set → drops the index → `CleanupDroppedIndex` queues a cleanup-task scan on the source primary.
2. The drop AOF entry replicates to the source's replica, which replays it and **also** queues a cleanup-task scan on the replica.
3. Cluster topology change (post-migration, gossip, or any reason) triggers a replica re-attach → `ReplicaDisklessSync.ReplicateAttachAsync` / `ReplicaDiskbasedSync.ReplicateAttachAsync` calls `storeWrapper.Reset()`.
4. The replica's cleanup task is still mid-iterate over the main store → AVE.
Thread-level interleaving:
```
Thread A: VectorManager cleanup task Thread B: storeWrapper.Reset()
───────────────────────────────────────── ─────────────────────────────────
loop session.Iterate(callbacks)
PushNext → ObjectScanIterator.GetNext()
epoch.Resume() ◄── enter at epoch E
headAddress = HeadAddress (still old value)
LoadPageIfNeeded(...) (cur >= head → in-mem)
physicalAddress =
pagePointers[pageIdx] + offset
Recovery.Reset()
hlogBase.Reset()
HeadAddress ← TailAddress
OnPagesClosed(...)
FreePage(p)
ReturnPage(p)
pagePointers[p] = 0 ◄── ★
// override loop:
for i in BufferSize:
FreePage(i)
ReturnPage(i)
pagePointers[i] = 0
*(RecordInfo*)physicalAddress ◄── ☠ AVE
(LogRecord.GetInfo /
LogRecord.AllocatedSize)
```
### Why epoch protection didn't catch this
Tsavorite's normal eviction path defers page-freeing through:
```csharp
epoch.BumpCurrentEpoch(() => OnPagesClosed(newAddr));
```
`BumpCurrentEpoch` queues the action and only fires it after
`SafeToReclaimEpoch` has advanced past the prior epoch — i.e. after every
thread that was holding the prior epoch has either suspended or moved on.
That's why scan iterators are safe against normal eviction.
`Reset()` skipped that mechanism in two places:
1. `AllocatorBase.Reset()` invoked `OnPagesClosed(newBeginAddress)` directly.
2. The per-allocator overrides had a `for (i in BufferSize) FreePage(i)`
loop that ran **after** `base.Reset()` returned — also without epoch
protection. **This second loop is the actual point of failure**: even
if `OnPagesClosed` were deferred, the leftover (tail) page is freed by
the override loop while a reader could still be reading it.
### The fix (Tsavorite layer)
`AllocatorBase.Reset()` defers ALL page-close + page-free work through
`BumpCurrentEpoch` and waits on a `ManualResetEventSlim` signalled by the
deferred action — no polling:
```csharp
using var resetComplete = new ManualResetEventSlim(initialState: false);
// If caller was already epoch-protected, our prior epoch is what the action
// will be waiting on — release it before waiting and re-acquire after.
var wasProtected = epoch.ThisInstanceProtected();
if (!wasProtected)
epoch.Resume(); // BumpCurrentEpoch requires a protected caller
try
{
epoch.BumpCurrentEpoch(() =>
{
try
{
if (headShifted) OnPagesClosed(newBeginAddress);
FreeAllAllocatedPages();
}
finally { resetComplete.Set(); } // never deadlock if action throws
});
}
finally { epoch.Suspend(); } // unconditionally so the action can fire
resetComplete.Wait();
if (wasProtected) epoch.Resume();
```
Each per-allocator override (`SpanByte` / `Object` / `TsavoriteLog`) moves
its `FreePage(i)` loop into a new `FreeAllAllocatedPages()` virtual so the
loop runs inside the deferred action:
```csharp
public override void Reset() { base.Reset(); Initialize(); }
protected override void FreeAllAllocatedPages()
{
for (int index = 0; index < BufferSize; index++)
if (IsAllocated(index)) FreePage(index);
}
```
### Why this is safe
* The deferred action runs only after `SafeToReclaimEpoch ≥ priorEpoch`,
i.e. after every iterator that was inside `GetNext` at the moment
`Reset()` was called has either suspended or advanced. By the time
`pagePointers[i] = 0` executes, no thread is reading `pagePointers[i]`.
* Iterators that re-enter `GetNext` after `HeadAddress` was shifted see
`currentAddress < headAddress` and route through the buffered disk frame
instead of `pagePointers` — so they don't touch the cleared array.
* `Reset()` blocks until the deferred work has actually run, preserving
its synchronous contract (the override's `Initialize()` after `Reset()`
observes a fully freed page set).
### Test vs. product
Strictly, `Reset()`'s docstring put the burden on callers. The cluster
re-attach paths violate that — they call `Reset()` before draining the
`VectorManager` cleanup task, and `SuspendPrimaryOnlyTasksAsync()` doesn't
cover it. The alternative would be to drain every background reader at
every `Reset()` callsite, but we chose to make `Reset()` itself epoch-safe
because the contract was implicit, callsites are scattered, and Tsavorite
already has the right primitive (`epoch.BumpCurrentEpoch`) — the normal
eviction path uses it. This makes the safety property **enforced** rather
than **assumed**, and protects any future caller / background reader.
### Repro
`test/Garnet.test/VectorCleanupVsResetRaceTests.cs` — adds 4 000 vectors,
drops the set (queues a full-keyspace cleanup scan), then spams
`storeWrapper.Reset()` for 5 s.
* **Without the fix:** crashes the host on every iteration with the exact
production stack (`LogRecord.get_Info` → `ObjectScanIterator.GetNext` →
`VectorManager.RunCleanupTaskAsync`).
* **With the fix:** all 5 `[Repeat]` iterations pass (~2 700 resets per
iteration concurrent with the cleanup iterator), no AVE.
## 2. `RespListTests.ListPushPopStressTest` — host crash on rare `RedisTimeoutException`
### Symptom
```
Unhandled exception. StackExchange.Redis.RedisTimeoutException: Timeout performing LPUSH (30000ms)
at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](...)
at StackExchange.Redis.RedisDatabase.ListLeftPush(...)
at Garnet.test.RespListTests.<>c__DisplayClass39_1.<ListPushPopStressTest>b__0()
The active test run was aborted. Reason: Test host process crashed
```
### Root cause (two compounding issues)
1. **Worker threads created via `new Thread(() => ...)` had no try/catch.**
In modern .NET an unhandled exception in a manually-created `Thread`
terminates the process, so a single transient `RedisTimeoutException`
aborted the entire test run.
2. **All 20 sync workers shared a single `ConnectionMultiplexer`.** Every
command went through one socket and one background writer. Under CI
load + lowMemory eviction overhead the writer falls behind and
accumulates queued messages until SyncTimeout (30s) trips. The failure
diagnostics confirmed this: `mc: 1/1, qs: 20, bw: SpinningDown`.
### Fix
* Pre-create one `ConnectionMultiplexer` per worker on the main thread.
Each thread now owns its own socket, eliminating the single-writer
bottleneck. Pre-creating also avoids a 20-way connect storm racing
`ConnectTimeout`.
* Wrap each worker body in try/catch; capture exceptions into a
`ConcurrentBag`, signal stop, exit cleanly. No more host crash.
* Throw the aggregate **before** the post-checks so a real timeout isn't
masked by secondary "list not empty" assertion noise.
* Route the deadline-exceeded path through the failure bag too.
## Files
```
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs | 76 +++++++++++++++++++++++--
libs/storage/Tsavorite/cs/src/core/Allocator/ObjectAllocatorImpl.cs | 7 ++-
libs/storage/Tsavorite/cs/src/core/Allocator/SpanByteAllocatorImpl.cs | 7 ++-
libs/storage/Tsavorite/cs/src/core/Allocator/TsavoriteLogAllocatorImpl.cs | 7 ++-
test/Garnet.test/RespListTests.cs | 124 +++++++++++++++++++++++++--------------
test/Garnet.test/VectorCleanupVsResetRaceTests.cs | new
```
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
1a44762 to
f9abcb0
Compare
ListPushPopStressTest host crash and VectorManager cleanup vs Reset() AVE
Two independent rare CI failures, both surfacing as
Test host process crashedand aborting the whole test run.1.
ClusterVectorSetTests.MigrateVectorSetWhileModifyingAsync— fatalAccessViolationExceptioninVectorManagercleanup taskSymptom
The AVE is a Corrupted-State Exception —
catch (Exception)inRunCleanupTaskAsynccannot suppress it; the runtime fails fast and the test host crashes.Root cause
Recovery.Reset()→hlogBase.Reset()(inAllocatorBaseand the per-allocator overridesSpanByte/Object/TsavoriteLog) frees pages by synchronously invokingOnPagesClosed(...)and afor (i in BufferSize) FreePage(i)loop. Both paths ultimately callReturnPage(index), which sets:Reset()'s docstring promised "WARNING: assumes that threads have drained out at this point." But Garnet's cluster re-attach paths invoke it on a running store:libs/cluster/Server/Replication/ReplicaOps/ReplicaDisklessSync.cs:100libs/cluster/Server/Replication/ReplicaOps/ReplicaDiskbasedSync.cs:136In both files
storeWrapper.Reset()is called beforeSuspendPrimaryOnlyTasksAsync(), and even that suspend only drainsTaskManagertasks —VectorManager.cleanupTaskis independent and never drained.Once
pagePointers[i] = 0, the iterator'sGetPhysicalAddressreturns0 + offset— a tiny kernel-page address — and dereferencing it in*(RecordInfo*)physicalAddressraises a fatal AVE.The exact interleaving
Production scenario in
MigrateVectorSetWhileModifyingAsync:CleanupDroppedIndexqueues a cleanup-task scan on the source primary.ReplicaDisklessSync.ReplicateAttachAsync/ReplicaDiskbasedSync.ReplicateAttachAsynccallsstoreWrapper.Reset().Thread-level interleaving:
Why epoch protection didn't catch this
Tsavorite's normal eviction path defers page-freeing through:
BumpCurrentEpochqueues the action and only fires it afterSafeToReclaimEpochhas advanced past the prior epoch — i.e. after every thread that was holding the prior epoch has either suspended or moved on. That's why scan iterators are safe against normal eviction.Reset()skipped that mechanism in two places:AllocatorBase.Reset()invokedOnPagesClosed(newBeginAddress)directly.for (i in BufferSize) FreePage(i)loop that ran afterbase.Reset()returned — also without epoch protection. This second loop is the actual point of failure: even ifOnPagesClosedwere deferred, the leftover (tail) page is freed by the override loop while a reader could still be reading it.The fix (Tsavorite layer)
AllocatorBase.Reset()defers ALL page-close + page-free work throughBumpCurrentEpochand waits on aManualResetEventSlimsignalled by the deferred action — no polling:Each per-allocator override (
SpanByte/Object/TsavoriteLog) moves itsFreePage(i)loop into a newFreeAllAllocatedPages()virtual so the loop runs inside the deferred action:Why this is safe
SafeToReclaimEpoch ≥ priorEpoch, i.e. after every iterator that was insideGetNextat the momentReset()was called has either suspended or advanced. By the timepagePointers[i] = 0executes, no thread is readingpagePointers[i].GetNextafterHeadAddresswas shifted seecurrentAddress < headAddressand route through the buffered disk frame instead ofpagePointers— so they don't touch the cleared array.Reset()blocks until the deferred work has actually run, preserving its synchronous contract (the override'sInitialize()afterReset()observes a fully freed page set).Test vs. product
Strictly,
Reset()'s docstring put the burden on callers. The cluster re-attach paths violate that — they callReset()before draining theVectorManagercleanup task, andSuspendPrimaryOnlyTasksAsync()doesn't cover it. The alternative would be to drain every background reader at everyReset()callsite, but we chose to makeReset()itself epoch-safe because the contract was implicit, callsites are scattered, and Tsavorite already has the right primitive (epoch.BumpCurrentEpoch) — the normal eviction path uses it. This makes the safety property enforced rather than assumed, and protects any future caller / background reader.Repro
test/Garnet.test/VectorCleanupVsResetRaceTests.cs— adds 4 000 vectors, drops the set (queues a full-keyspace cleanup scan), then spamsstoreWrapper.Reset()for 5 s.LogRecord.get_Info→ObjectScanIterator.GetNext→VectorManager.RunCleanupTaskAsync).[Repeat]iterations pass (~2 700 resets per iteration concurrent with the cleanup iterator), no AVE.2.
RespListTests.ListPushPopStressTest— host crash on rareRedisTimeoutExceptionSymptom
Root cause (two compounding issues)
Worker threads created via
new Thread(() => ...)had no try/catch. In modern .NET an unhandled exception in a manually-createdThreadterminates the process, so a single transientRedisTimeoutExceptionaborted the entire test run.All 20 sync workers shared a single
ConnectionMultiplexer. Every command went through one socket and one background writer. Under CI load + lowMemory eviction overhead the writer falls behind and accumulates queued messages until SyncTimeout (30s) trips. The failure diagnostics confirmed this:mc: 1/1, qs: 20, bw: SpinningDown.Fix
ConnectionMultiplexerper worker on the main thread. Each thread now owns its own socket, eliminating the single-writer bottleneck. Pre-creating also avoids a 20-way connect storm racingConnectTimeout.ConcurrentBag, signal stop, exit cleanly. No more host crash.Files