Skip to content

fix: eliminate Akka.Cluster.Hosting.Tests flakiness (SynchronizationContext leak + TestActor startup race)#744

Merged
Aaronontheweb merged 2 commits into
akkadotnet:devfrom
Aaronontheweb:fix/testkit-synccontext-leak
May 18, 2026
Merged

fix: eliminate Akka.Cluster.Hosting.Tests flakiness (SynchronizationContext leak + TestActor startup race)#744
Aaronontheweb merged 2 commits into
akkadotnet:devfrom
Aaronontheweb:fix/testkit-synccontext-leak

Conversation

@Aaronontheweb
Copy link
Copy Markdown
Member

Summary

Fixes two distinct, compounding sources of flakiness in Akka.Cluster.Hosting.Tests that, post-merge of #735, caused a 5-test cascade failure and an ~8-minute suite runtime on dev.

1. SynchronizationContext leak across sequential xUnit v3 tests (137e9e5)

Akka.TestKit.TestKitBase.InitializeTest unconditionally installs an ActorCellKeepingSynchronizationContext on the current thread. Akka.Hosting.TestKit calls base.InitializeTest from inside a StartActors callback that runs during async host startup. SetSynchronizationContext is per-thread and is not unwound by await, and nothing scrubbed it — so that context leaked out of InitializeAsyncCore and was captured by xUnit v3's CreateTestClassInstance[AkkaCleanAmbientContext].Before.

In a sequentially-run xUnit v3 suite (parallelizeTestCollections: false) every test then inherited the prior (disposed) test's SynchronizationContext, pinning continuations onto a dead ActorCell. For Akka.Cluster.Hosting.Tests this cascaded into cluster-startup timeouts (ClusterJoinFailedException: "Cluster has already been terminated") and blew the suite runtime from ~30s to ~8min.

Fix: bracket the base.InitializeTest call with a synchronous save/restore of SynchronizationContext.Current so the context it installs cannot escape the StartActors delegate; plus a finally around InitializeAsyncCore as defense-in-depth.

2. TestActor terminated by the host-startup actor-creation race (7c88e4e)

Akka.Hosting.TestKit creates its TestActor (an InternalTestActor under /system on the CallingThreadDispatcher) via base.InitializeTest, inside a StartActors callback running during _host.StartAsync() — concurrently with remoting, clustering and other extensions creating their own /system actors. ActorCell.MakeChild runs synchronously on that foreign startup thread, non-atomically with SystemGuardian's own dispatcher thread. The freshly-created TestActor is intermittently, cleanly terminated a few milliseconds later (confirmed via a death-watch probe + actor-tree dump: the TestActor dies while the rest of the system stays healthy). Once dead, every message to it dead-letters and ExpectMsg times out.

This is Akka.Hosting.TestKit-specific — mainline Akka.TestKit creates the TestActor in a quiet constructor, not amid a host-startup storm.

Fix:

  • EnsureTestActorAliveAsync — after host startup completes (system quiet), verify the TestActor survived and re-create it via base.InitializeTest if it did not. Re-creation in the now-quiet system is race-free.
  • ClusterShardingDistributedDataSpecs — join the cluster during host startup via WithActors (matching every other cluster spec) instead of in the test body, so the cluster-formation storm completes within the window EnsureTestActorAliveAsync covers.

Verification

  • Akka.Cluster.Hosting.Tests: 34/34 green across 25 consecutive runs (was ~10-30% flaky, plus the 5-test cascade)
  • Akka.Hosting.TestKit.Tests: 305/305 green
  • Akka.Hosting.Tests: 140/141 green (1 pre-existing skip)
  • Akka.Hosting.API.Tests: 5/5 green (recovery methods are private — no public API change)

Notes / follow-ups

  • The TestActor-termination fix is a deterministic recovery (re-create in the quiet post-startup window), not a prevention. The underlying race — non-atomic ActorCell.MakeChild invoked on a foreign thread racing SystemGuardian lifecycle processing during concurrent /system actor startup — is an upstream Akka.NET concern worth a dedicated akkadotnet/akka.net issue.

Akka.TestKit's TestKitBase.InitializeTest unconditionally installs an
ActorCellKeepingSynchronizationContext on the current thread. Akka.Hosting.TestKit
calls base.InitializeTest from inside a StartActors delegate that runs during
async host startup. SetSynchronizationContext is per-thread and is not unwound by
await, and nothing scrubbed it, so that context leaked out of InitializeAsyncCore
and was captured by xUnit v3's CreateTestClassInstance -> [AkkaCleanAmbientContext].

In a sequentially-run xUnit v3 suite (parallelizeTestCollections: false) this
caused each test to inherit the previous (disposed) test's SynchronizationContext,
pinning continuations onto a dead ActorCell. For Akka.Cluster.Hosting.Tests this
cascaded into cluster-startup timeouts (ClusterJoinFailedException: "Cluster has
already been terminated") and blew the suite runtime from ~30s to ~8min.

Fix:
- Bracket the base.InitializeTest call with a synchronous save/restore of
  SynchronizationContext.Current so the context it installs cannot escape the
  StartActors delegate.
- Restore the entry SynchronizationContext in a finally around InitializeAsyncCore
  as defense-in-depth, guaranteeing a clean context is handed back to xUnit
  regardless of which continuation thread the startup chain returns on.

Verified: Akka.Cluster.Hosting.Tests 34/34 green across 20+ sequential runs
(24-35s), zero cluster-startup cascades; Akka.Hosting.TestKit.Tests 305/305 green.
…race

Akka.Hosting.TestKit creates the TestActor (an InternalTestActor under /system on
the CallingThreadDispatcher) via base.InitializeTest, inside a StartActors callback
that runs during _host.StartAsync() — concurrently with remoting, clustering and
other extensions creating their own /system actors. That concurrent startup storm
intermittently terminates the freshly-created TestActor: it is created successfully,
then cleanly terminated a few milliseconds later (confirmed by death-watch probe and
actor-tree dump — the rest of the system stays healthy, only the TestActor dies).
Once dead, every message sent to it dead-letters and ExpectMsg calls time out.

This is Akka.Hosting.TestKit-specific: mainline Akka.TestKit creates the TestActor in
a quiet constructor, not amid a host-startup storm. The exact Akka-core trigger could
not be pinned without a debugger attached to core internals, so this is a recovery,
not a prevention.

Fix:
- Akka.Hosting.TestKit: after host startup completes (system quiet), EnsureTestActorAliveAsync
  verifies the TestActor survived and re-creates it via base.InitializeTest if it did not.
  Re-creation in the now-quiet system is race-free.
- ClusterShardingDistributedDataSpecs: join the cluster during host startup via WithActors
  (matching every other cluster spec) instead of in the test body, so the cluster-formation
  storm completes within the window EnsureTestActorAliveAsync covers.

Verified: Akka.Cluster.Hosting.Tests 34/34 green across 25 consecutive runs (was
~10-30% flaky, plus a 5-test cascade); Akka.Hosting.TestKit.Tests 305/305 green.
Copy link
Copy Markdown
Member Author

@Aaronontheweb Aaronontheweb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Aaronontheweb Aaronontheweb merged commit 50a666a into akkadotnet:dev May 18, 2026
2 checks passed
@Aaronontheweb Aaronontheweb deleted the fix/testkit-synccontext-leak branch May 18, 2026 21:28
@Aaronontheweb Aaronontheweb mentioned this pull request May 18, 2026
This was referenced May 19, 2026
This was referenced May 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant