Skip profile samples for threads without the GVL since the last sample#5751
Skip profile samples for threads without the GVL since the last sample#5751eregon wants to merge 7 commits into
Conversation
…ample Hook RUBY_INTERNAL_THREAD_EVENT_SUSPENDED to record when a thread last released the GVL into a new gvl_suspended_at TLS slot. In update_metrics_and_sample, return early without advancing wall_time_at_previous_sample_ns when the thread has been continuously suspended since its previous sample (currently suspended with the same timestamp as last time). The accumulated wall-time gets picked up by an extended after-resume sample: on_gvl_running_with_threshold now also returns ON_GVL_RUNNING_SAMPLE when gvl_suspended_at exceeds the threshold, so the existing postponed-job pipeline emits a sample with a near-idle stack covering the whole idle window. For `bundle exec ruby benchmarks/profiling_sample_overhead.rb` (sleep 10), this skips ~3 per-thread samples per tick (the idle background threads), tracked via the new inactive_thread_samples_skipped counter surfaced in worker_stats. Gated to Ruby 3.3+ (the existing GVL profiling stack works there without the 3.2 workarounds) and to gvl_profiling_enabled. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
👋 Hey @eregon, please fill "Change log entry" section in the pull request description. If changes need to be present in CHANGELOG.md you can state it this way **Change log entry**
Yes. A brief summary to be placed into the CHANGELOG.md(possible answers Yes/Yep/Yeah) Or you can opt out like that **Change log entry**
None.(possible answers No/Nope/None) Visited at: 2026-05-13 13:00:44 UTC |
The skip optimization relies on the after-resume sample to deposit the accumulated wall-time of a long idle period into the profile. If a thread stays suspended across the entire reporting period (e.g. `sleep` longer than the export interval), RESUMED never fires within that period and the wall-time is missing from the profile — and from every subsequent profile until the thread eventually wakes up. Fix by adding thread_context_collector_flush_inactive_threads, called from Exporter#flush before pprof_recorder.serialize. It walks the per-thread context map and force-samples any thread still in the would-be-skipped state (gvl_suspended_at > 0 and equal to the stored value), so the accumulated wall-time lands in the active slot before the slot flip. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
BenchmarksBenchmark execution time: 2026-05-13 19:22:41 Comparing candidate commit fed2240 in PR branch Found 0 performance improvements and 0 performance regressions! Performance is the same for 45 metrics, 1 unstable metrics.
|
On Rubies where the GVL profiling APIs aren't available (NO_GVL_INSTRUMENTATION / USE_GVL_PROFILING_3_2_WORKAROUNDS), the body of _native_flush_inactive_threads is preprocessed out and `instance` goes unused, which `-Werror=unused-parameter` rejects. Mark it DDTRACE_UNUSED — the attribute is a "may be unused" hint, so it remains valid on Rubies where the parameter is used. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: fed2240 | Docs | Datadog PR Page | Give us feedback! |
The SUSPENDED internal-thread-event hook can fire very frequently (every
GVL release: sleep, IO, Thread.pass, lock contention, ...). Calling
monotonic_wall_time_now_ns() there — even though it's a vDSO call —
adds up. The skip optimization only needed two pieces of information
out of that timestamp ("did the thread acquire the GVL since the last
sample" and "is it currently suspended"), so a counter is enough.
The TLS slot now holds:
- low bit: current state (1 = suspended, 0 = running)
- bits 1+: monotonic event counter, incremented on every SUSPENDED
or RESUMED for a tracked thread
SUSPENDED sets the bit and bumps the counter; RESUMED clears the bit
and bumps the counter. The sampler reads the word once: the low bit is
the current state, equality with its per-thread snapshot means no
transitions since the last sample. Together that's exactly "did not
have the GVL last sample and did not acquire it since then" — no
wall-time read in the hot path.
The previous duration-based after-resume trigger goes away with the
timestamp (we don't track when the suspension started anymore). To
keep the catch-up sample at end-of-period, per_thread_context gains a
`was_skipped_at_last_sample` flag that flush_inactive_threads now keys
off — it also handles the case where a long-suspended thread RESUMES
just before serialize, which the old timestamp equality check would
have missed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SUSPENDED and RESUMED alternate for any given thread (the VM never RESUMES a thread without a prior SUSPENDED and vice-versa), so a single counter that bumps on each transition is enough: its low bit naturally encodes the current state (1 after SUSPENDED, 0 after RESUMED). No more explicit state-bit + counter packing. The test helpers were the one place that didn't honor the invariant — the simulated `on_gvl_running` was called without a prior simulated `on_gvl_suspended`, which would have left the counter parity off by one. Add a matching `_native_on_gvl_suspended` test entry point and call it from the "thread is ready to run again" before-block so the test path mirrors the real lifecycle. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…flake The TLS slot persists across worker instances within the same process, so a spec that exercised the SUSPENDED hook would leave the counter in a "suspended" state for the test-runner thread, and a subsequent spec running without GVL profiling would see counter == per_thread_context's freshly-zeroed snapshot AND the low bit set, and skip almost every sample for that thread. cpu_and_wall_time_worker_spec's "sampling from signal handler" was the visible victim: it expects ~20 samples but only sees ~1. Clear the slot in initialize_context next to the existing gvl_profiling_state reset, so each worker starts each thread from a known-zero state. Restore the explicit-state-bit encoding (low bit = current state, bits 1+ = monotonic event counter) so the hooks can correct the state on the very next event after the clear, even for threads that happened to be mid-suspension when initialize_context cleared them — the pure- counter-parity scheme would have ended up off by one in that window. With explicit state bits the simulated test helpers no longer need a matching on_gvl_suspended call before on_gvl_running, so drop that. Flaky spec passes 8/8 runs after the fix; full thread_context + worker specs green on Ruby 3.2 and 4.0. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
What does this PR do?
Motivation:
Change log entry
Additional Notes:
How to test the change?