fix: eliminate Akka.Cluster.Hosting.Tests flakiness (SynchronizationContext leak + TestActor startup race)#744
Merged
Aaronontheweb merged 2 commits intoMay 18, 2026
Conversation
Akka.TestKit's TestKitBase.InitializeTest unconditionally installs an ActorCellKeepingSynchronizationContext on the current thread. Akka.Hosting.TestKit calls base.InitializeTest from inside a StartActors delegate that runs during async host startup. SetSynchronizationContext is per-thread and is not unwound by await, and nothing scrubbed it, so that context leaked out of InitializeAsyncCore and was captured by xUnit v3's CreateTestClassInstance -> [AkkaCleanAmbientContext]. In a sequentially-run xUnit v3 suite (parallelizeTestCollections: false) this caused each test to inherit the previous (disposed) test's SynchronizationContext, pinning continuations onto a dead ActorCell. For Akka.Cluster.Hosting.Tests this cascaded into cluster-startup timeouts (ClusterJoinFailedException: "Cluster has already been terminated") and blew the suite runtime from ~30s to ~8min. Fix: - Bracket the base.InitializeTest call with a synchronous save/restore of SynchronizationContext.Current so the context it installs cannot escape the StartActors delegate. - Restore the entry SynchronizationContext in a finally around InitializeAsyncCore as defense-in-depth, guaranteeing a clean context is handed back to xUnit regardless of which continuation thread the startup chain returns on. Verified: Akka.Cluster.Hosting.Tests 34/34 green across 20+ sequential runs (24-35s), zero cluster-startup cascades; Akka.Hosting.TestKit.Tests 305/305 green.
…race Akka.Hosting.TestKit creates the TestActor (an InternalTestActor under /system on the CallingThreadDispatcher) via base.InitializeTest, inside a StartActors callback that runs during _host.StartAsync() — concurrently with remoting, clustering and other extensions creating their own /system actors. That concurrent startup storm intermittently terminates the freshly-created TestActor: it is created successfully, then cleanly terminated a few milliseconds later (confirmed by death-watch probe and actor-tree dump — the rest of the system stays healthy, only the TestActor dies). Once dead, every message sent to it dead-letters and ExpectMsg calls time out. This is Akka.Hosting.TestKit-specific: mainline Akka.TestKit creates the TestActor in a quiet constructor, not amid a host-startup storm. The exact Akka-core trigger could not be pinned without a debugger attached to core internals, so this is a recovery, not a prevention. Fix: - Akka.Hosting.TestKit: after host startup completes (system quiet), EnsureTestActorAliveAsync verifies the TestActor survived and re-creates it via base.InitializeTest if it did not. Re-creation in the now-quiet system is race-free. - ClusterShardingDistributedDataSpecs: join the cluster during host startup via WithActors (matching every other cluster spec) instead of in the test body, so the cluster-formation storm completes within the window EnsureTestActorAliveAsync covers. Verified: Akka.Cluster.Hosting.Tests 34/34 green across 25 consecutive runs (was ~10-30% flaky, plus a 5-test cascade); Akka.Hosting.TestKit.Tests 305/305 green.
Merged
This was referenced May 19, 2026
Open
Open
This was referenced May 21, 2026
Closed
Open
This was referenced May 28, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes two distinct, compounding sources of flakiness in
Akka.Cluster.Hosting.Teststhat, post-merge of #735, caused a 5-test cascade failure and an ~8-minute suite runtime ondev.1. SynchronizationContext leak across sequential xUnit v3 tests (
137e9e5)Akka.TestKit.TestKitBase.InitializeTestunconditionally installs anActorCellKeepingSynchronizationContexton the current thread.Akka.Hosting.TestKitcallsbase.InitializeTestfrom inside aStartActorscallback that runs during async host startup.SetSynchronizationContextis per-thread and is not unwound byawait, and nothing scrubbed it — so that context leaked out ofInitializeAsyncCoreand was captured by xUnit v3'sCreateTestClassInstance→[AkkaCleanAmbientContext].Before.In a sequentially-run xUnit v3 suite (
parallelizeTestCollections: false) every test then inherited the prior (disposed) test'sSynchronizationContext, pinning continuations onto a deadActorCell. ForAkka.Cluster.Hosting.Teststhis cascaded into cluster-startup timeouts (ClusterJoinFailedException: "Cluster has already been terminated") and blew the suite runtime from ~30s to ~8min.Fix: bracket the
base.InitializeTestcall with a synchronous save/restore ofSynchronizationContext.Currentso the context it installs cannot escape theStartActorsdelegate; plus afinallyaroundInitializeAsyncCoreas defense-in-depth.2. TestActor terminated by the host-startup actor-creation race (
7c88e4e)Akka.Hosting.TestKitcreates itsTestActor(anInternalTestActorunder/systemon theCallingThreadDispatcher) viabase.InitializeTest, inside aStartActorscallback running during_host.StartAsync()— concurrently with remoting, clustering and other extensions creating their own/systemactors.ActorCell.MakeChildruns synchronously on that foreign startup thread, non-atomically withSystemGuardian's own dispatcher thread. The freshly-created TestActor is intermittently, cleanly terminated a few milliseconds later (confirmed via a death-watch probe + actor-tree dump: the TestActor dies while the rest of the system stays healthy). Once dead, every message to it dead-letters andExpectMsgtimes out.This is
Akka.Hosting.TestKit-specific — mainlineAkka.TestKitcreates the TestActor in a quiet constructor, not amid a host-startup storm.Fix:
EnsureTestActorAliveAsync— after host startup completes (system quiet), verify the TestActor survived and re-create it viabase.InitializeTestif it did not. Re-creation in the now-quiet system is race-free.ClusterShardingDistributedDataSpecs— join the cluster during host startup viaWithActors(matching every other cluster spec) instead of in the test body, so the cluster-formation storm completes within the windowEnsureTestActorAliveAsynccovers.Verification
Akka.Cluster.Hosting.Tests: 34/34 green across 25 consecutive runs (was ~10-30% flaky, plus the 5-test cascade)Akka.Hosting.TestKit.Tests: 305/305 greenAkka.Hosting.Tests: 140/141 green (1 pre-existing skip)Akka.Hosting.API.Tests: 5/5 green (recovery methods areprivate— no public API change)Notes / follow-ups
ActorCell.MakeChildinvoked on a foreign thread racingSystemGuardianlifecycle processing during concurrent/systemactor startup — is an upstream Akka.NET concern worth a dedicatedakkadotnet/akka.netissue.