Skip to content

Fix flaky: DSM Kafka consumer backlogs serialization#5760

Open
p-datadog wants to merge 1 commit into
masterfrom
fix-flaky-dsm-kafka-281
Open

Fix flaky: DSM Kafka consumer backlogs serialization#5760
p-datadog wants to merge 1 commit into
masterfrom
fix-flaky-dsm-kafka-281

Conversation

@p-datadog
Copy link
Copy Markdown
Member

@p-datadog p-datadog commented May 13, 2026

What does this PR do?

Fixes flaky test spec/datadog/data_streams/processor_spec.rb:294
(Datadog::DataStreams::Processor Kafka tracking methods #track_kafka_consume serializes consumer backlogs with type:kafka_commit tag)
by stopping the auto-spawned background worker thread at the start of the
test.

Motivation:

Flaky test reported in ruby-guild#281.
First-attempt failure on Ruby 3.2 / build & test (standard) [1];
attempt 2 (rerun) passed.

Failure:

expect(backlogs.length).to eq(2)
  expected: 2
       got: 1

Root cause: the auto-spawned background worker thread races with the
test thread. Processor.new schedules a worker that runs perform_loop's
first iteration immediately (loop_wait_before_first_iteration? is false).
If the OS schedules the worker after the first track_kafka_consume but
before the second:

  1. Test pushes event1 to @event_buffer.
  2. Worker wakes up, calls process_events → drains [event1]
    @consumer_stats = [event1].
  3. Worker continues to flush_stats → executes @consumer_stats.clear at
    processor.rb:313
    @consumer_stats = [].
  4. Test pushes event2.
  5. Test calls process_events → drains [event2]
    @consumer_stats = [event2].
  6. serialize_consumer_backlogs returns one entry → expected 2, got 1.

Fix: Stop the worker at the start of the test so the test owns the
buffer and @consumer_stats lifecycle synchronously. Mirrors the fix
applied in PR #5715
(commit 76683752f7)
for the adjacent #flush_stats test, which has the same race class.

Change log entry

None.

How to test the change?

Reproducer is in companion PR — that PR's CI shows the race-forcing variant
(sleep 0.5 between pushes) failing deterministically. The validation
companion PR merges this fix with the reproducer and CI passes, proving
the fix neutralizes the forced failure.

Companion PRs:

Validation results:

Root cause: the auto-spawned background worker thread races with the test
thread. Processor.new schedules a worker that runs perform_loop's first
iteration immediately (loop_wait_before_first_iteration? is false). If the
OS schedules the worker after the first track_kafka_consume but before the
second, the worker drains @event_buffer (consuming event1 into
@consumer_stats) and flush_stats then clears @consumer_stats — so the
test's later process_events only sees event2 and serialize_consumer_backlogs
returns 1 instead of 2 (the issue's exact symptom: expected 2, got 1).

Stop the worker at the start of the test so the test owns the buffer and
@consumer_stats lifecycle synchronously. Mirrors the fix applied in PR
#5715 (7668375) for the adjacent #flush_stats test, which has the same
race class.

Verified locally: reproducer (sleep 0.5 between pushes) fails with
expected 2, got 1; with this fix the same reproducer passes.

Fixes ruby-guild#281.

Co-Authored-By: Claude <noreply@anthropic.com>
@p-datadog p-datadog added the AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos label May 13, 2026
@dd-octo-sts dd-octo-sts Bot added the dev/testing Involves testing processes (e.g. RSpec) label May 13, 2026
@datadog-datadog-prod-us1-2
Copy link
Copy Markdown

datadog-datadog-prod-us1-2 Bot commented May 13, 2026

Tests

🎉 All green!

❄️ No new flaky tests detected
🧪 All tests passed

🎯 Code Coverage (details)
Patch Coverage: 100.00%
Overall Coverage: 97.15% (+0.00%)

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 5733f1b | Docs | Datadog PR Page | Give us feedback!

@p-datadog p-datadog marked this pull request as ready for review May 13, 2026 23:08
@p-datadog p-datadog requested review from a team as code owners May 13, 2026 23:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI Generated Largely based on code generated by an AI or LLM. This label is the same across all dd-trace-* repos dev/testing Involves testing processes (e.g. RSpec)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants