Summary
On GitHub Actions macos-latest hosted runners, Scalene occasionally hangs at startup when launched as a subprocess with memory profiling enabled. The subprocess produces no stdout/stderr before it is SIGKILL'd by the parent's timeout. The hang has not been reproducible locally on macOS 3.12 (passes 10/10 in <3s each), only on hosted runners.
Symptom
A subprocess invocation like:
python -m scalene run --json --outfile out.json some_script.py
never returns. After the parent's timeout (we use 60s in tests), subprocess.TimeoutExpired fires and the runner reports returncode: -9 (SIGKILL). stdout_seq is empty in the captured exception, indicating Scalene wrote nothing before hanging.
Where it has been observed
What we ruled out
Plausible root causes (none verified)
- A race during
DYLD_INSERT_LIBRARIES injection of libscalene.dylib.
- A
sys.monitoring setup race on Python 3.12 specifically.
- A native heap hook ↔ GIL deadlock under Python 3.12's adaptive specialization.
- macOS hosted runner kernel quirks around library injection / signal delivery.
Workaround currently in place
Commit 4e3b318 in PR #1029 wraps subprocess.run calls in test/test_tracer.py with a retry helper that catches TimeoutExpired, retries up to 3× at 60s each, and skipTests rather than fails when every attempt hangs. This keeps CI green on transient hangs, but masks the underlying bug — a user actually running Scalene on macOS could hit the same hang.
Suggested next steps
- Add an opt-in
SCALENE_DEBUG=1 env var that emits async-signal-safe stderr breadcrumbs at each major startup phase (preload init, pywhere import, sys.monitoring registration, signal handler install). Next time CI hangs, the logs will pinpoint the stuck phase.
- Or: set up a
tmate-enabled CI job that pauses on failure so we can SSH into a hung runner and attach lldb / sample to the Scalene process.
- If reproducible, the fix likely lives in
src/source/libscalene.cpp or scalene/scalene_tracer.py.
Summary
On GitHub Actions
macos-latesthosted runners, Scalene occasionally hangs at startup when launched as a subprocess with memory profiling enabled. The subprocess produces no stdout/stderr before it is SIGKILL'd by the parent's timeout. The hang has not been reproducible locally on macOS 3.12 (passes 10/10 in <3s each), only on hosted runners.Symptom
A subprocess invocation like:
never returns. After the parent's timeout (we use 60s in tests),
subprocess.TimeoutExpiredfires and the runner reportsreturncode: -9(SIGKILL).stdout_seqis empty in the captured exception, indicating Scalene wrote nothing before hanging.Where it has been observed
run-tests (macos-latest, 3.12), intest/test_tracer.py::TestTracerModes::test_legacy_tracerandTestFunctionCallHandling::test_function_call_attribution.run-tests (macos-latest, 3.12), same file (test/test_tracer.py::TestFunctionCallHandling::test_function_call_attribution) — confirms pre-existing, not introduced by Propagate --profile-all and scope filters to child processes (#1022) #1029.What we ruled out
subprocess.run(..., capture_output=True, timeout=...)drains pipes viaselectinternally; pipe-fill cannot be the cause. Also, the captured stdout was empty — Scalene wrote nothing before the hang.[0] * 10_000_000allocations and aprint(); runs in milliseconds without Scalene.Plausible root causes (none verified)
DYLD_INSERT_LIBRARIESinjection oflibscalene.dylib.sys.monitoringsetup race on Python 3.12 specifically.Workaround currently in place
Commit 4e3b318 in PR #1029 wraps
subprocess.runcalls intest/test_tracer.pywith a retry helper that catchesTimeoutExpired, retries up to 3× at 60s each, andskipTests rather than fails when every attempt hangs. This keeps CI green on transient hangs, but masks the underlying bug — a user actually running Scalene on macOS could hit the same hang.Suggested next steps
SCALENE_DEBUG=1env var that emits async-signal-safe stderr breadcrumbs at each major startup phase (preload init, pywhere import, sys.monitoring registration, signal handler install). Next time CI hangs, the logs will pinpoint the stuck phase.tmate-enabled CI job that pauses on failure so we can SSH into a hung runner and attachlldb/sampleto the Scalene process.src/source/libscalene.cpporscalene/scalene_tracer.py.