Skip to content

Skip profile samples for threads without the GVL since the last sample#5751

Draft
eregon wants to merge 7 commits into
masterfrom
benchmark-sampling-overhead
Draft

Skip profile samples for threads without the GVL since the last sample#5751
eregon wants to merge 7 commits into
masterfrom
benchmark-sampling-overhead

Conversation

@eregon
Copy link
Copy Markdown
Member

@eregon eregon commented May 13, 2026

What does this PR do?

Motivation:

Change log entry

Additional Notes:

How to test the change?

eregon and others added 2 commits May 13, 2026 14:55
…ample

Hook RUBY_INTERNAL_THREAD_EVENT_SUSPENDED to record when a thread last
released the GVL into a new gvl_suspended_at TLS slot. In
update_metrics_and_sample, return early without advancing
wall_time_at_previous_sample_ns when the thread has been continuously
suspended since its previous sample (currently suspended with the same
timestamp as last time). The accumulated wall-time gets picked up by
an extended after-resume sample: on_gvl_running_with_threshold now
also returns ON_GVL_RUNNING_SAMPLE when gvl_suspended_at exceeds the
threshold, so the existing postponed-job pipeline emits a sample with
a near-idle stack covering the whole idle window.

For `bundle exec ruby benchmarks/profiling_sample_overhead.rb`
(sleep 10), this skips ~3 per-thread samples per tick (the idle
background threads), tracked via the new
inactive_thread_samples_skipped counter surfaced in worker_stats.

Gated to Ruby 3.3+ (the existing GVL profiling stack works there
without the 3.2 workarounds) and to gvl_profiling_enabled.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 13, 2026

👋 Hey @eregon, please fill "Change log entry" section in the pull request description.

If changes need to be present in CHANGELOG.md you can state it this way

**Change log entry**

Yes. A brief summary to be placed into the CHANGELOG.md

(possible answers Yes/Yep/Yeah)

Or you can opt out like that

**Change log entry**

None.

(possible answers No/Nope/None)

Visited at: 2026-05-13 13:00:44 UTC

@dd-octo-sts dd-octo-sts Bot added the profiling Involves Datadog profiling label May 13, 2026
The skip optimization relies on the after-resume sample to deposit the
accumulated wall-time of a long idle period into the profile. If a thread
stays suspended across the entire reporting period (e.g. `sleep` longer
than the export interval), RESUMED never fires within that period and
the wall-time is missing from the profile — and from every subsequent
profile until the thread eventually wakes up.

Fix by adding thread_context_collector_flush_inactive_threads, called
from Exporter#flush before pprof_recorder.serialize. It walks the
per-thread context map and force-samples any thread still in the
would-be-skipped state (gvl_suspended_at > 0 and equal to the stored
value), so the accumulated wall-time lands in the active slot before
the slot flip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@pr-commenter
Copy link
Copy Markdown

pr-commenter Bot commented May 13, 2026

Benchmarks

Benchmark execution time: 2026-05-13 19:22:41

Comparing candidate commit fed2240 in PR branch benchmark-sampling-overhead with baseline commit 52edc02 in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 45 metrics, 1 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

On Rubies where the GVL profiling APIs aren't available
(NO_GVL_INSTRUMENTATION / USE_GVL_PROFILING_3_2_WORKAROUNDS), the body
of _native_flush_inactive_threads is preprocessed out and `instance`
goes unused, which `-Werror=unused-parameter` rejects. Mark it
DDTRACE_UNUSED — the attribute is a "may be unused" hint, so it
remains valid on Rubies where the parameter is used.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@datadog-prod-us1-3
Copy link
Copy Markdown

datadog-prod-us1-3 Bot commented May 13, 2026

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 80.00%
Overall Coverage: 97.07% (-0.09%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: fed2240 | Docs | Datadog PR Page | Give us feedback!

eregon and others added 3 commits May 13, 2026 18:19
The SUSPENDED internal-thread-event hook can fire very frequently (every
GVL release: sleep, IO, Thread.pass, lock contention, ...). Calling
monotonic_wall_time_now_ns() there — even though it's a vDSO call —
adds up. The skip optimization only needed two pieces of information
out of that timestamp ("did the thread acquire the GVL since the last
sample" and "is it currently suspended"), so a counter is enough.

The TLS slot now holds:
  - low bit:  current state (1 = suspended, 0 = running)
  - bits 1+:  monotonic event counter, incremented on every SUSPENDED
              or RESUMED for a tracked thread
SUSPENDED sets the bit and bumps the counter; RESUMED clears the bit
and bumps the counter. The sampler reads the word once: the low bit is
the current state, equality with its per-thread snapshot means no
transitions since the last sample. Together that's exactly "did not
have the GVL last sample and did not acquire it since then" — no
wall-time read in the hot path.

The previous duration-based after-resume trigger goes away with the
timestamp (we don't track when the suspension started anymore). To
keep the catch-up sample at end-of-period, per_thread_context gains a
`was_skipped_at_last_sample` flag that flush_inactive_threads now keys
off — it also handles the case where a long-suspended thread RESUMES
just before serialize, which the old timestamp equality check would
have missed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
SUSPENDED and RESUMED alternate for any given thread (the VM never
RESUMES a thread without a prior SUSPENDED and vice-versa), so a single
counter that bumps on each transition is enough: its low bit naturally
encodes the current state (1 after SUSPENDED, 0 after RESUMED). No more
explicit state-bit + counter packing.

The test helpers were the one place that didn't honor the invariant —
the simulated `on_gvl_running` was called without a prior simulated
`on_gvl_suspended`, which would have left the counter parity off by
one. Add a matching `_native_on_gvl_suspended` test entry point and
call it from the "thread is ready to run again" before-block so the
test path mirrors the real lifecycle.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…flake

The TLS slot persists across worker instances within the same process,
so a spec that exercised the SUSPENDED hook would leave the counter in
a "suspended" state for the test-runner thread, and a subsequent spec
running without GVL profiling would see counter == per_thread_context's
freshly-zeroed snapshot AND the low bit set, and skip almost every
sample for that thread. cpu_and_wall_time_worker_spec's "sampling from
signal handler" was the visible victim: it expects ~20 samples but
only sees ~1.

Clear the slot in initialize_context next to the existing
gvl_profiling_state reset, so each worker starts each thread from a
known-zero state.

Restore the explicit-state-bit encoding (low bit = current state,
bits 1+ = monotonic event counter) so the hooks can correct the state
on the very next event after the clear, even for threads that happened
to be mid-suspension when initialize_context cleared them — the pure-
counter-parity scheme would have ended up off by one in that window.
With explicit state bits the simulated test helpers no longer need a
matching on_gvl_suspended call before on_gvl_running, so drop that.

Flaky spec passes 8/8 runs after the fix; full thread_context + worker
specs green on Ruby 3.2 and 4.0.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

profiling Involves Datadog profiling

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant