Fix flaky: DSM Kafka consumer backlogs serialization#5760
Open
p-datadog wants to merge 1 commit into
Open
Conversation
Root cause: the auto-spawned background worker thread races with the test thread. Processor.new schedules a worker that runs perform_loop's first iteration immediately (loop_wait_before_first_iteration? is false). If the OS schedules the worker after the first track_kafka_consume but before the second, the worker drains @event_buffer (consuming event1 into @consumer_stats) and flush_stats then clears @consumer_stats — so the test's later process_events only sees event2 and serialize_consumer_backlogs returns 1 instead of 2 (the issue's exact symptom: expected 2, got 1). Stop the worker at the start of the test so the test owns the buffer and @consumer_stats lifecycle synchronously. Mirrors the fix applied in PR #5715 (7668375) for the adjacent #flush_stats test, which has the same race class. Verified locally: reproducer (sleep 0.5 between pushes) fails with expected 2, got 1; with this fix the same reproducer passes. Fixes ruby-guild#281. Co-Authored-By: Claude <noreply@anthropic.com>
This was referenced May 13, 2026
🎉 All green!❄️ No new flaky tests detected 🎯 Code Coverage (details) 🔗 Commit SHA: 5733f1b | Docs | Datadog PR Page | Give us feedback! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Fixes flaky test
spec/datadog/data_streams/processor_spec.rb:294(
Datadog::DataStreams::Processor Kafka tracking methods #track_kafka_consume serializes consumer backlogs with type:kafka_commit tag)by stopping the auto-spawned background worker thread at the start of the
test.
Motivation:
Flaky test reported in ruby-guild#281.
First-attempt failure on Ruby 3.2 / build & test (standard) [1];
attempt 2 (rerun) passed.
Failure:
Root cause: the auto-spawned background worker thread races with the
test thread.
Processor.newschedules a worker that runsperform_loop'sfirst iteration immediately (
loop_wait_before_first_iteration?is false).If the OS schedules the worker after the first
track_kafka_consumebutbefore the second:
@event_buffer.process_events→ drains[event1]→@consumer_stats = [event1].flush_stats→ executes@consumer_stats.clearatprocessor.rb:313→
@consumer_stats = [].process_events→ drains[event2]→@consumer_stats = [event2].serialize_consumer_backlogsreturns one entry →expected 2, got 1.Fix: Stop the worker at the start of the test so the test owns the
buffer and
@consumer_statslifecycle synchronously. Mirrors the fixapplied in PR #5715
(commit
76683752f7)for the adjacent
#flush_statstest, which has the same race class.Change log entry
None.
How to test the change?
Reproducer is in companion PR — that PR's CI shows the race-forcing variant
(
sleep 0.5between pushes) failing deterministically. The validationcompanion PR merges this fix with the reproducer and CI passes, proving
the fix neutralizes the forced failure.
Companion PRs:
Validation results:
expected 2, got 1) on this exact test. Closed.