Skip to content

Fix flaky: shutdown 'executes only once'#5585

Draft
p-datadog wants to merge 1 commit intomasterfrom
fix-flaky-shutdown-executes-only-once
Draft

Fix flaky: shutdown 'executes only once'#5585
p-datadog wants to merge 1 commit intomasterfrom
fix-flaky-shutdown-executes-only-once

Conversation

@p-datadog
Copy link
Copy Markdown
Member

@p-datadog p-datadog commented Apr 11, 2026

What does this PR do?

Fixes flaky test spec/datadog/tracing/integration_spec.rb:592 — the shutdown "executes only once" integration test.

Motivation:

Flaky test reported in PR #5581. The test intermittently fails with traces_flushed: 0 under CI load.

Failure: The assertion expect(stats).to include(traces_flushed: 1) fails with traces_flushed: 0.

Root cause: The test asserts traces_flushed: 1 immediately after concurrent shutdown! calls join. shutdown! calls Writer#stop, which calls AsyncTransport#stop, which joins the worker thread with a 1-second timeout (DEFAULT_SHUTDOWN_TIMEOUT). When the HTTP round-trip to the agent takes longer than 1 second under CI load, the join times out. The worker thread is still alive completing the HTTP call, but @traces_flushed hasn't been incremented yet. The assertion fires and sees 0.

Fix: Add try_wait_until { tracer.writer.stats[:traces_flushed] >= 1 } AFTER the shutdown threads join, not before them. This preserves the test's coverage: the concurrent shutdowns still race with an in-flight flush (the buffer has data when shutdowns begin), so the shutdown guard and worker stop logic are exercised under real contention. The wait just gives the orphaned worker thread time to complete its HTTP call before the assertion checks final stats.

This is deliberately different from PR #5581's approach (which waits BEFORE shutdowns). Waiting before drains the buffer, so all shutdown flushes are no-ops — making traces_flushed: 1 tautological regardless of whether the shutdown guard works. Waiting after preserves the race between shutdowns and the flush.

Change log entry

None.

How to test the change?

Integration test validated via the three-PR reproducer/fix/validation pattern:

Root cause: the test asserts traces_flushed: 1 immediately after
concurrent shutdown! calls join. When the HTTP round-trip to the agent
takes longer than DEFAULT_SHUTDOWN_TIMEOUT (1s), the worker thread's
join times out and the flush is still in-progress. The assertion sees
traces_flushed: 0 and fails.

Wait for traces_flushed >= 1 AFTER the shutdown threads join, not
before. This preserves the test's coverage: the concurrent shutdowns
still race with an in-flight flush (the buffer has data when shutdowns
begin), so the shutdown guard and worker stop logic are exercised under
contention. The wait just gives the orphaned worker thread time to
complete its HTTP call before the assertion checks final stats.

Co-Authored-By: Claude <noreply@anthropic.com>
@p-datadog p-datadog added the AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos label Apr 11, 2026
@github-actions github-actions bot added the dev/testing Involves testing processes (e.g. RSpec) label Apr 11, 2026
@datadog-official
Copy link
Copy Markdown

datadog-official bot commented Apr 11, 2026

✅ Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 95.35% (-0.02%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 0e5fc33 | Docs | Datadog PR Page | Was this helpful? React with 👍/👎 or give us feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos dev/testing Involves testing processes (e.g. RSpec)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants