Skip to content

ci: enable performance quality gates#5571

Draft
igoragoli wants to merge 15 commits intoaugusto/add-perf-quality-gate-dd-octo-sts-policyfrom
augusto/enable-perf-quality-gates
Draft

ci: enable performance quality gates#5571
igoragoli wants to merge 15 commits intoaugusto/add-perf-quality-gate-dd-octo-sts-policyfrom
augusto/enable-perf-quality-gates

Conversation

@igoragoli
Copy link
Copy Markdown
Contributor

@igoragoli igoragoli commented Apr 9, 2026

What does this PR do?

Enables pre-release performance quality gates on dd-trace-rb.

  • Microbenchmarks: microbenchmarks-check-big-regressions job (20% threshold via fail_on_regression)
  • Macrobenchmarks: macrobenchmarks-check-slo-breaches + macrobenchmarks-notify-slo-breaches jobs with SLO thresholds via fail_on_breach
    • 36 scenarios, 66 thresholds
    • normal_operation: p50/p99 latency
    • high_load: throughput
    • utilization monitors: CPU% and RSS
    • baseline scenarios excluded (not actionable)

Motivation:

Catch performance regressions before release. Aligns dd-trace-rb with dd-trace-go and dd-trace-py.

Change log entry

None.

Additional Notes:

SLO generation:

  • Generated with benchmark_analyzer generate slos --strategy tight --significant-impact-threshold 0.10 (T=10%)
  • Source: single pipeline run of all 8 macrobenchmark configurations
  • One RSS threshold manually bumped (high_load--profiling-and-tracing-and-appsec--puma-utilization: 2.73 GB → 3.25 GB) due to cross-run variance

Quality gates setup:

  • All gate jobs use allow_failure: true until thresholds are validated
  • Slack notifications go to apm-dcs-performance-alerts (TODO: switch to #guild-dd-ruby)
  • tracing-and-appsec macrobenchmark produced no k6 results, so it has no SLO thresholds yet
  • Depends on ci: add dd-octo-sts policy for GitLab SLO change tracking #5570

How to test the change?

CI pipeline validates gate jobs run correctly after benchmarks complete.

Add macrobenchmarks-gates and macrobenchmarks-notify stages. Include
check-slo-breaches and notify-slo-breaches templates from
benchmarking-platform-tools. Add placeholder check-slo-breaches job
that depends on all 8 macrobenchmark jobs.

Temporarily set macrobenchmarks to auto-trigger on all branches to
collect baseline artifacts for SLO threshold generation.
Adds a quality gate that fails on microbenchmark regressions exceeding
20%. Uses bp-runner fail_on_regression step from benchmarking-platform.
Runs after microbenchmarks with when: always to catch failures too.
Set to allow_failure: true until thresholds are validated.
@github-actions
Copy link
Copy Markdown

github-actions bot commented Apr 9, 2026

Thank you for updating Change log entry section 👏

Visited at: 2026-04-09 14:50:40 UTC

@igoragoli igoragoli changed the title ci: scaffold macrobenchmark quality gates and auto-trigger benchmarks ci: enable performance quality gates Apr 9, 2026
Copy link
Copy Markdown
Contributor Author

Warning

This pull request is not mergeable via GitHub because a downstack PR is open. Once all requirements are satisfied, merge this PR as a stack on Graphite.
Learn more

This stack of pull requests is managed by Graphite. Learn more about stacking.

@igoragoli igoragoli added the AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos label Apr 9, 2026
@pr-commenter
Copy link
Copy Markdown

pr-commenter bot commented Apr 9, 2026

Benchmarks

Benchmark execution time: 2026-04-10 11:41:04

Comparing candidate commit d1e7605 in PR branch augusto/enable-perf-quality-gates with baseline commit 1595023 in branch augusto/add-perf-quality-gate-dd-octo-sts-policy.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 45 metrics, 1 unstable metrics.

Explanation

This is an A/B test comparing a candidate commit's performance against that of a baseline commit. Performance changes are noted in the tables below as:

  • 🟩 = significantly better candidate vs. baseline
  • 🟥 = significantly worse candidate vs. baseline

We compute a confidence interval (CI) over the relative difference of means between metrics from the candidate and baseline commits, considering the baseline as the reference.

If the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD), the change is considered significant.

Feel free to reach out to #apm-benchmarking-platform on Slack if you have any questions.

More details about the CI and significant changes

You can imagine this CI as a range of values that is likely to contain the true difference of means between the candidate and baseline commits.

CIs of the difference of means are often centered around 0%, because often changes are not that big:

---------------------------------(------|---^--------)-------------------------------->
                              -0.6%    0%  0.3%     +1.2%
                                 |          |        |
         lower bound of the CI --'          |        |
sample mean (center of the CI) -------------'        |
         upper bound of the CI ----------------------'

As described above, a change is considered significant if the CI is entirely outside the configured SIGNIFICANT_IMPACT_THRESHOLD (or the deprecated UNCONFIDENCE_THRESHOLD).

For instance, for an execution time metric, this confidence interval indicates a significantly worse performance:

----------------------------------------|---------|---(---------^---------)---------->
                                       0%        1%  1.3%      2.2%      3.1%
                                                  |   |         |         |
       significant impact threshold --------------'   |         |         |
                      lower bound of CI --------------'         |         |
       sample mean (center of the CI) --------------------------'         |
                      upper bound of CI ----------------------------------'

Replace check-slo-breaches placeholder with real fail_on_breach
implementation. Add notify-slo-breaches job to alert on
apm-dcs-performance-alerts. Generate 209 SLO thresholds across
42 scenarios using tight strategy (T=5%).

Revert macrobenchmarks to manual trigger on non-master branches.
Move microbenchmarks before macrobenchmarks so macro gates and notify
stages are adjacent. Restrict check-slo-breaches and notify-slo-breaches
to master only since non-master branches use manual macrobenchmarks.
@igoragoli igoragoli force-pushed the augusto/enable-perf-quality-gates branch from 2e12e39 to efb574d Compare April 9, 2026 17:38
Drop rules: block from check-slo-breaches and notify-slo-breaches.
GitLab ignores top-level when: when rules: is present. Follow
dd-trace-py pattern: use when: always with no rules.
Use rules: with when: always on master, default on_success on branches.
Remove conflicting top-level when: always which GitLab ignores when
rules: is present.
@igoragoli igoragoli force-pushed the augusto/enable-perf-quality-gates branch from 0568f25 to c866a4b Compare April 10, 2026 08:29
Remove baseline scenarios (not actionable). Keep only:
- normal_operation: agg_http_req_duration p50/p99
- high_load: throughput
- utilization monitors: cpu_usage_percentage, rss

Drop data_received, data_sent, dropped_iterations, http_req_duration.
Reduces from 209 to 66 thresholds across 36 scenarios.
@igoragoli igoragoli force-pushed the augusto/enable-perf-quality-gates branch from c866a4b to c3caecc Compare April 10, 2026 08:31
Fix macrobenchmarks-notify-slo-breaches referencing wrong job name.
Move when: always into rules for microbenchmarks-check-big-regressions
since GitLab ignores top-level when: when rules: is present.
Single-run SLO generation produced a tight RSS threshold (2.73 GB)
that doesn't account for cross-run variance. Bump to 3.25 GB based
on observed values across multiple runs.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant