Commit 8849a5f
Fix two rare CI failures: ListPushPopStressTest host crash and VectorManager cleanup vs Reset() AVE
Two independent rare CI failures, both surfacing as `Test host process
crashed` and aborting the whole test run.
## 1. `ClusterVectorSetTests.MigrateVectorSetWhileModifyingAsync` — fatal `AccessViolationException` in `VectorManager` cleanup task
### Symptom
```
Passed Garnet.test.cluster.ClusterVectorSetTests.MigrateVectorSetWhileModifyingAsync [12 s]
Fatal error. System.AccessViolationException: Attempted to read or write protected memory.
at Tsavorite.core.LogRecord.get_Info()
at Tsavorite.core.LogRecord.get_AllocatedSize()
at Tsavorite.core.ObjectScanIterator`2[...].GetPhysicalAddressAndAllocatedSize(...)
at Tsavorite.core.ObjectScanIterator`2[...].GetNext()
at Tsavorite.core.TsavoriteKVIterator`6[...].PushNext[...](...)
at Tsavorite.core.TsavoriteKV`2[...].Iterate[...](MainSessionFunctions, ...)
at Garnet.server.VectorManager+<RunCleanupTaskAsync>d__24.MoveNext()
The active test run was aborted. Reason: Test host process crashed
```
The AVE is a Corrupted-State Exception — `catch (Exception)` in
`RunCleanupTaskAsync` cannot suppress it; the runtime fails fast and the
test host crashes.
### Root cause
`Recovery.Reset()` → `hlogBase.Reset()` (in `AllocatorBase` and the
per-allocator overrides `SpanByte` / `Object` / `TsavoriteLog`) frees pages
by synchronously invoking `OnPagesClosed(...)` and a
`for (i in BufferSize) FreePage(i)` loop. Both paths ultimately call
`ReturnPage(index)`, which sets:
```csharp
pageArrays[index] = default;
pagePointers[index] = default; // ★ becomes 0
```
`Reset()`'s docstring promised *"WARNING: assumes that threads have drained
out at this point."* But Garnet's cluster re-attach paths invoke it on a
running store:
* `libs/cluster/Server/Replication/ReplicaOps/ReplicaDisklessSync.cs:100`
* `libs/cluster/Server/Replication/ReplicaOps/ReplicaDiskbasedSync.cs:136`
In both files `storeWrapper.Reset()` is called **before**
`SuspendPrimaryOnlyTasksAsync()`, and even that suspend only drains
`TaskManager` tasks — `VectorManager.cleanupTask` is independent and never
drained.
Once `pagePointers[i] = 0`, the iterator's `GetPhysicalAddress` returns
`0 + offset` — a tiny kernel-page address — and dereferencing it in
`*(RecordInfo*)physicalAddress` raises a fatal AVE.
### The exact interleaving
Production scenario in `MigrateVectorSetWhileModifyingAsync`:
1. Source primary migrates a slot containing a vector set → drops the index → `CleanupDroppedIndex` queues a cleanup-task scan on the source primary.
2. The drop AOF entry replicates to the source's replica, which replays it and **also** queues a cleanup-task scan on the replica.
3. Cluster topology change (post-migration, gossip, or any reason) triggers a replica re-attach → `ReplicaDisklessSync.ReplicateAttachAsync` / `ReplicaDiskbasedSync.ReplicateAttachAsync` calls `storeWrapper.Reset()`.
4. The replica's cleanup task is still mid-iterate over the main store → AVE.
Thread-level interleaving:
```
Thread A: VectorManager cleanup task Thread B: storeWrapper.Reset()
───────────────────────────────────────── ─────────────────────────────────
loop session.Iterate(callbacks)
PushNext → ObjectScanIterator.GetNext()
epoch.Resume() ◄── enter at epoch E
headAddress = HeadAddress (still old value)
LoadPageIfNeeded(...) (cur >= head → in-mem)
physicalAddress =
pagePointers[pageIdx] + offset
Recovery.Reset()
hlogBase.Reset()
HeadAddress ← TailAddress
OnPagesClosed(...)
FreePage(p)
ReturnPage(p)
pagePointers[p] = 0 ◄── ★
// override loop:
for i in BufferSize:
FreePage(i)
ReturnPage(i)
pagePointers[i] = 0
*(RecordInfo*)physicalAddress ◄── ☠ AVE
(LogRecord.GetInfo /
LogRecord.AllocatedSize)
```
### Why epoch protection didn't catch this
Tsavorite's normal eviction path defers page-freeing through:
```csharp
epoch.BumpCurrentEpoch(() => OnPagesClosed(newAddr));
```
`BumpCurrentEpoch` queues the action and only fires it after
`SafeToReclaimEpoch` has advanced past the prior epoch — i.e. after every
thread that was holding the prior epoch has either suspended or moved on.
That's why scan iterators are safe against normal eviction.
`Reset()` skipped that mechanism in two places:
1. `AllocatorBase.Reset()` invoked `OnPagesClosed(newBeginAddress)` directly.
2. The per-allocator overrides had a `for (i in BufferSize) FreePage(i)`
loop that ran **after** `base.Reset()` returned — also without epoch
protection. **This second loop is the actual point of failure**: even
if `OnPagesClosed` were deferred, the leftover (tail) page is freed by
the override loop while a reader could still be reading it.
### The fix (Tsavorite layer)
`AllocatorBase.Reset()` defers ALL page-close + page-free work through
`BumpCurrentEpoch` and waits on a `ManualResetEventSlim` signalled by the
deferred action — no polling:
```csharp
using var resetComplete = new ManualResetEventSlim(initialState: false);
// If caller was already epoch-protected, our prior epoch is what the action
// will be waiting on — release it before waiting and re-acquire after.
var wasProtected = epoch.ThisInstanceProtected();
if (!wasProtected)
epoch.Resume(); // BumpCurrentEpoch requires a protected caller
try
{
epoch.BumpCurrentEpoch(() =>
{
try
{
if (headShifted) OnPagesClosed(newBeginAddress);
FreeAllAllocatedPages();
}
finally { resetComplete.Set(); } // never deadlock if action throws
});
}
finally { epoch.Suspend(); } // unconditionally so the action can fire
resetComplete.Wait();
if (wasProtected) epoch.Resume();
```
Each per-allocator override (`SpanByte` / `Object` / `TsavoriteLog`) moves
its `FreePage(i)` loop into a new `FreeAllAllocatedPages()` virtual so the
loop runs inside the deferred action:
```csharp
public override void Reset() { base.Reset(); Initialize(); }
protected override void FreeAllAllocatedPages()
{
for (int index = 0; index < BufferSize; index++)
if (IsAllocated(index)) FreePage(index);
}
```
### Why this is safe
* The deferred action runs only after `SafeToReclaimEpoch ≥ priorEpoch`,
i.e. after every iterator that was inside `GetNext` at the moment
`Reset()` was called has either suspended or advanced. By the time
`pagePointers[i] = 0` executes, no thread is reading `pagePointers[i]`.
* Iterators that re-enter `GetNext` after `HeadAddress` was shifted see
`currentAddress < headAddress` and route through the buffered disk frame
instead of `pagePointers` — so they don't touch the cleared array.
* `Reset()` blocks until the deferred work has actually run, preserving
its synchronous contract (the override's `Initialize()` after `Reset()`
observes a fully freed page set).
### Test vs. product
Strictly, `Reset()`'s docstring put the burden on callers. The cluster
re-attach paths violate that — they call `Reset()` before draining the
`VectorManager` cleanup task, and `SuspendPrimaryOnlyTasksAsync()` doesn't
cover it. The alternative would be to drain every background reader at
every `Reset()` callsite, but we chose to make `Reset()` itself epoch-safe
because the contract was implicit, callsites are scattered, and Tsavorite
already has the right primitive (`epoch.BumpCurrentEpoch`) — the normal
eviction path uses it. This makes the safety property **enforced** rather
than **assumed**, and protects any future caller / background reader.
### Repro
`test/Garnet.test/VectorCleanupVsResetRaceTests.cs` — adds 4 000 vectors,
drops the set (queues a full-keyspace cleanup scan), then spams
`storeWrapper.Reset()` for 5 s.
* **Without the fix:** crashes the host on every iteration with the exact
production stack (`LogRecord.get_Info` → `ObjectScanIterator.GetNext` →
`VectorManager.RunCleanupTaskAsync`).
* **With the fix:** all 5 `[Repeat]` iterations pass (~2 700 resets per
iteration concurrent with the cleanup iterator), no AVE.
## 2. `RespListTests.ListPushPopStressTest` — host crash on rare `RedisTimeoutException`
### Symptom
```
Unhandled exception. StackExchange.Redis.RedisTimeoutException: Timeout performing LPUSH (30000ms)
at StackExchange.Redis.ConnectionMultiplexer.ExecuteSyncImpl[T](...)
at StackExchange.Redis.RedisDatabase.ListLeftPush(...)
at Garnet.test.RespListTests.<>c__DisplayClass39_1.<ListPushPopStressTest>b__0()
The active test run was aborted. Reason: Test host process crashed
```
### Root cause (two compounding issues)
1. **Worker threads created via `new Thread(() => ...)` had no try/catch.**
In modern .NET an unhandled exception in a manually-created `Thread`
terminates the process, so a single transient `RedisTimeoutException`
aborted the entire test run.
2. **All 20 sync workers shared a single `ConnectionMultiplexer`.** Every
command went through one socket and one background writer. Under CI
load + lowMemory eviction overhead the writer falls behind and
accumulates queued messages until SyncTimeout (30s) trips. The failure
diagnostics confirmed this: `mc: 1/1, qs: 20, bw: SpinningDown`.
### Fix
* Pre-create one `ConnectionMultiplexer` per worker on the main thread.
Each thread now owns its own socket, eliminating the single-writer
bottleneck. Pre-creating also avoids a 20-way connect storm racing
`ConnectTimeout`.
* Wrap each worker body in try/catch; capture exceptions into a
`ConcurrentBag`, signal stop, exit cleanly. No more host crash.
* Throw the aggregate **before** the post-checks so a real timeout isn't
masked by secondary "list not empty" assertion noise.
* Route the deadline-exceeded path through the failure bag too.
## Files
```
libs/storage/Tsavorite/cs/src/core/Allocator/AllocatorBase.cs | 76 +++++++++++++++++++++++--
libs/storage/Tsavorite/cs/src/core/Allocator/ObjectAllocatorImpl.cs | 7 ++-
libs/storage/Tsavorite/cs/src/core/Allocator/SpanByteAllocatorImpl.cs | 7 ++-
libs/storage/Tsavorite/cs/src/core/Allocator/TsavoriteLogAllocatorImpl.cs | 7 ++-
test/Garnet.test/RespListTests.cs | 124 +++++++++++++++++++++++++--------------
test/Garnet.test/VectorCleanupVsResetRaceTests.cs | new
```
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>1 parent 950d976 commit 8849a5f
6 files changed
Lines changed: 310 additions & 53 deletions
File tree
- libs/storage/Tsavorite/cs/src/core/Allocator
- test/Garnet.test
Lines changed: 64 additions & 12 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
243 | 243 | | |
244 | 244 | | |
245 | 245 | | |
246 | | - | |
| 246 | + | |
| 247 | + | |
| 248 | + | |
| 249 | + | |
| 250 | + | |
| 251 | + | |
| 252 | + | |
| 253 | + | |
247 | 254 | | |
248 | 255 | | |
249 | 256 | | |
| |||
253 | 260 | | |
254 | 261 | | |
255 | 262 | | |
256 | | - | |
257 | | - | |
258 | | - | |
259 | | - | |
260 | | - | |
| 263 | + | |
| 264 | + | |
| 265 | + | |
| 266 | + | |
| 267 | + | |
| 268 | + | |
| 269 | + | |
| 270 | + | |
| 271 | + | |
| 272 | + | |
| 273 | + | |
| 274 | + | |
| 275 | + | |
| 276 | + | |
| 277 | + | |
| 278 | + | |
| 279 | + | |
| 280 | + | |
| 281 | + | |
| 282 | + | |
| 283 | + | |
| 284 | + | |
| 285 | + | |
261 | 286 | | |
262 | | - | |
263 | | - | |
| 287 | + | |
| 288 | + | |
| 289 | + | |
264 | 290 | | |
265 | | - | |
266 | | - | |
267 | | - | |
268 | | - | |
| 291 | + | |
| 292 | + | |
| 293 | + | |
| 294 | + | |
| 295 | + | |
| 296 | + | |
| 297 | + | |
| 298 | + | |
| 299 | + | |
| 300 | + | |
| 301 | + | |
| 302 | + | |
| 303 | + | |
| 304 | + | |
| 305 | + | |
| 306 | + | |
| 307 | + | |
269 | 308 | | |
270 | 309 | | |
| 310 | + | |
| 311 | + | |
| 312 | + | |
| 313 | + | |
| 314 | + | |
| 315 | + | |
271 | 316 | | |
272 | 317 | | |
273 | 318 | | |
| |||
281 | 326 | | |
282 | 327 | | |
283 | 328 | | |
| 329 | + | |
| 330 | + | |
| 331 | + | |
| 332 | + | |
| 333 | + | |
| 334 | + | |
| 335 | + | |
284 | 336 | | |
285 | 337 | | |
286 | 338 | | |
| |||
Lines changed: 6 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
117 | 117 | | |
118 | 118 | | |
119 | 119 | | |
| 120 | + | |
| 121 | + | |
| 122 | + | |
| 123 | + | |
| 124 | + | |
| 125 | + | |
120 | 126 | | |
121 | 127 | | |
122 | 128 | | |
123 | 129 | | |
124 | 130 | | |
125 | | - | |
126 | 131 | | |
127 | 132 | | |
128 | 133 | | |
| |||
Lines changed: 6 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
26 | 26 | | |
27 | 27 | | |
28 | 28 | | |
| 29 | + | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
29 | 35 | | |
30 | 36 | | |
31 | 37 | | |
32 | 38 | | |
33 | 39 | | |
34 | | - | |
35 | 40 | | |
36 | 41 | | |
37 | 42 | | |
| |||
Lines changed: 6 additions & 1 deletion
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
27 | 27 | | |
28 | 28 | | |
29 | 29 | | |
| 30 | + | |
| 31 | + | |
| 32 | + | |
| 33 | + | |
| 34 | + | |
| 35 | + | |
30 | 36 | | |
31 | 37 | | |
32 | 38 | | |
33 | 39 | | |
34 | 40 | | |
35 | | - | |
36 | 41 | | |
37 | 42 | | |
38 | 43 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
2 | 2 | | |
3 | 3 | | |
4 | 4 | | |
| 5 | + | |
5 | 6 | | |
6 | 7 | | |
7 | 8 | | |
| |||
1091 | 1092 | | |
1092 | 1093 | | |
1093 | 1094 | | |
| 1095 | + | |
| 1096 | + | |
| 1097 | + | |
| 1098 | + | |
| 1099 | + | |
| 1100 | + | |
| 1101 | + | |
| 1102 | + | |
| 1103 | + | |
| 1104 | + | |
| 1105 | + | |
| 1106 | + | |
1094 | 1107 | | |
1095 | | - | |
1096 | | - | |
1097 | | - | |
| 1108 | + | |
1098 | 1109 | | |
1099 | | - | |
1100 | | - | |
1101 | | - | |
| 1110 | + | |
| 1111 | + | |
| 1112 | + | |
1102 | 1113 | | |
1103 | | - | |
1104 | | - | |
1105 | | - | |
1106 | | - | |
| 1114 | + | |
| 1115 | + | |
| 1116 | + | |
| 1117 | + | |
| 1118 | + | |
| 1119 | + | |
| 1120 | + | |
| 1121 | + | |
| 1122 | + | |
| 1123 | + | |
| 1124 | + | |
| 1125 | + | |
| 1126 | + | |
| 1127 | + | |
| 1128 | + | |
| 1129 | + | |
| 1130 | + | |
| 1131 | + | |
| 1132 | + | |
| 1133 | + | |
| 1134 | + | |
| 1135 | + | |
| 1136 | + | |
| 1137 | + | |
| 1138 | + | |
| 1139 | + | |
| 1140 | + | |
| 1141 | + | |
| 1142 | + | |
| 1143 | + | |
| 1144 | + | |
| 1145 | + | |
| 1146 | + | |
| 1147 | + | |
| 1148 | + | |
| 1149 | + | |
| 1150 | + | |
| 1151 | + | |
| 1152 | + | |
| 1153 | + | |
| 1154 | + | |
| 1155 | + | |
| 1156 | + | |
1107 | 1157 | | |
1108 | | - | |
| 1158 | + | |
| 1159 | + | |
1109 | 1160 | | |
1110 | | - | |
| 1161 | + | |
| 1162 | + | |
1111 | 1163 | | |
1112 | | - | |
1113 | | - | |
| 1164 | + | |
| 1165 | + | |
1114 | 1166 | | |
1115 | | - | |
1116 | | - | |
| 1167 | + | |
| 1168 | + | |
| 1169 | + | |
1117 | 1170 | | |
1118 | | - | |
1119 | | - | |
1120 | 1171 | | |
1121 | | - | |
1122 | | - | |
1123 | | - | |
| 1172 | + | |
| 1173 | + | |
| 1174 | + | |
| 1175 | + | |
| 1176 | + | |
| 1177 | + | |
| 1178 | + | |
1124 | 1179 | | |
1125 | | - | |
1126 | | - | |
1127 | | - | |
1128 | | - | |
1129 | | - | |
| 1180 | + | |
| 1181 | + | |
| 1182 | + | |
| 1183 | + | |
| 1184 | + | |
| 1185 | + | |
1130 | 1186 | | |
1131 | | - | |
1132 | | - | |
1133 | | - | |
| 1187 | + | |
| 1188 | + | |
1134 | 1189 | | |
1135 | 1190 | | |
1136 | 1191 | | |
1137 | 1192 | | |
1138 | | - | |
1139 | | - | |
1140 | | - | |
1141 | | - | |
1142 | | - | |
1143 | | - | |
1144 | | - | |
1145 | | - | |
1146 | | - | |
| 1193 | + | |
| 1194 | + | |
1147 | 1195 | | |
1148 | 1196 | | |
1149 | 1197 | | |
| |||
0 commit comments