tbs: Add test confirming the potential data race #19948

ericywl · 2025-12-19T09:49:02Z

Summary

Fix potential data race between WriteTraceEvent in ProcessBatch and ReadTraceEvent in the sampling goroutine.

Performance

Baseline

goos: darwin
goarch: arm64
pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
cpu: Apple M4 Pro
BenchmarkProcess-14      2718606               415.6 ns/op
BenchmarkProcess-14      2839388               396.6 ns/op
BenchmarkProcess-14      2951276               385.8 ns/op
BenchmarkProcess-14      2897508               390.0 ns/op
BenchmarkProcess-14      3140571               370.0 ns/op

Single Mutex

goos: darwin
goarch: arm64
pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
cpu: Apple M4 Pro
BenchmarkProcess-14      2405176               428.1 ns/op
BenchmarkProcess-14      3260113               344.5 ns/op
BenchmarkProcess-14      3042301               363.3 ns/op
BenchmarkProcess-14      3153118               359.9 ns/op
BenchmarkProcess-14      3190714               362.7 ns/op

ShardLockReadWriter

goos: darwin
goarch: arm64
pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
cpu: Apple M4 Pro
BenchmarkProcess-14      2899808               408.1 ns/op
BenchmarkProcess-14      3228548               352.0 ns/op
BenchmarkProcess-14      3240921               350.5 ns/op
BenchmarkProcess-14      3181731               371.2 ns/op
BenchmarkProcess-14      3088440               360.9 ns/op

github-actions · 2025-12-19T09:49:14Z

🤖 GitHub comments

Just comment with:

run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

mergify · 2025-12-19T09:49:39Z

This pull request does not have a backport label. Could you fix it @ericywl? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

backport-7.17 is the label to automatically backport to the 7.17 branch.
backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
backport-9./d is the label to automatically backport to the 9./d branch. /d is the digit.
backport-active-all is the label that automatically backports to all active branches.
backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

carsonip

Thanks. The setup looks about right, but the specific case that needs to be tested is not what I'm thinking about.

In this PR, there are txn1 and txn2 both part of trace1, which don't make sense to me. 1 trace should only have 1 root transaction. In your current setup, what should happen to txn2 is undefined.

To be clearer, what I'd like to test is similar: txn1 is root txn of trace1, and txn2 is a child of txn1.

At time t1: apm server receives txn1
t2: background sampling goroutine: apm server makes sampling decision for txn1
t2': apm server receives txn2
t3: background sampling goroutine: marks trace1 as sampled

^ the above is a race, because apm server receives txn2 between t2 and t3, and the result is txn2 is lost forever. If it happens either before t2 or after t3, txn2 is exported correctly.

It gets a bit theoretical but I believe it is possible. Lmk if you have any questions.

x-pack/apm-server/sampling/processor_test.go

carsonip

qq: should the test pass or fail here? I assume when the test passes, it means a race happened, right? If so, I think the test is correctly validating the race in its current state

x-pack/apm-server/sampling/processor_test.go

x-pack/apm-server/sampling/processor.go

elasticmachine · 2025-12-24T12:04:57Z

💚 Build Succeeded

Buildkite Build
Commit: 0878f16

History

💚 Build #8173 succeeded 619a092
💚 Build #8133 succeeded a96c059
💚 Build #8123 succeeded 8a36e41
💚 Build #8110 succeeded 67496ee

cc @ericywl

carsonip

thanks, the approach looks good now

carsonip · 2025-12-24T12:50:06Z

x-pack/apm-server/sampling/eventstorage/rw.go

+func (rw *lockedReadWriter) IsTraceSampled(traceID string) (bool, error) {
+	rw.mu.Lock()
+	defer rw.mu.Unlock()
+	return rw.rw.IsTraceSampled(traceID)
+}


I acknowledge 8.x doesn't have the optimization to use a RWMutex, but given the ingest hot path is IsTraceSampled->WriteTraceEvent, I wonder if swapping it to RWMutex would further reduce lock contention. However, it may or may not show up in the benchmark, depending on GOMAXPROCS, as well as trace sampled hit rate (which will be 0 because the benchmark always generates new trace ID)

carsonip

in the benchmarks do you mind also running at a higher GOMAXPROCS? e.g. -cpu=14,100 and see if it makes any difference?

Add test confirming the potential data race

0894640

ericywl self-assigned this Dec 19, 2025

ericywl changed the title ~~Add test confirming the potential data race~~ tbs: Add test confirming the potential data race Dec 19, 2025

ericywl mentioned this pull request Dec 19, 2025

TBS: potential data loss in race condition between event arrival and receiving decision #17772

Open

ericywl and others added 3 commits December 19, 2025 18:05

Remove unnecessary sleeps

67496ee

Add assertion for transaction ids at the end

fd8db35

Merge branch 'main' into tbs-potential-data-race

8a36e41

carsonip reviewed Dec 19, 2025

View reviewed changes

x-pack/apm-server/sampling/processor_test.go Show resolved Hide resolved

x-pack/apm-server/sampling/processor_test.go Outdated Show resolved Hide resolved

ericywl force-pushed the tbs-potential-data-race branch from 1e2035e to 8f27d8c Compare December 22, 2025 05:12

Add parent id to transaction2

a96c059

ericywl force-pushed the tbs-potential-data-race branch from 675d174 to a96c059 Compare December 22, 2025 05:20

ericywl requested a review from carsonip December 22, 2025 05:21

carsonip reviewed Dec 22, 2025

View reviewed changes

x-pack/apm-server/sampling/processor_test.go Outdated Show resolved Hide resolved

x-pack/apm-server/sampling/processor_test.go Outdated Show resolved Hide resolved

ericywl added 2 commits December 23, 2025 20:40

Update potential race condition test

406d642

Try fixing race condition

619a092

ericywl requested a review from carsonip December 23, 2025 12:44

carsonip reviewed Dec 23, 2025

View reviewed changes

x-pack/apm-server/sampling/processor.go Outdated Show resolved Hide resolved

ericywl and others added 5 commits December 24, 2025 14:45

Fix bug where multiple ongoing trasactions can race to delete first

f187e90

Add ShardLockReadWriter

85b0444

Panic if numShards <= 0

01029b6

Remove unnecessary code

7575b6f

Merge branch 'main' into tbs-potential-data-race

0878f16

carsonip reviewed Dec 24, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

tbs: Add test confirming the potential data race #19948

tbs: Add test confirming the potential data race #19948

ericywl commented Dec 19, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Dec 19, 2025

Uh oh!

mergify bot commented Dec 19, 2025

Uh oh!

carsonip left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

carsonip left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticmachine commented Dec 24, 2025

Uh oh!

carsonip left a comment

Uh oh!

carsonip Dec 24, 2025

Uh oh!

carsonip left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

tbs: Add test confirming the potential data race #19948

Are you sure you want to change the base?

tbs: Add test confirming the potential data race #19948

Conversation

ericywl commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance

Baseline

Single Mutex

ShardLockReadWriter

Uh oh!

github-actions bot commented Dec 19, 2025

🤖 GitHub comments

Uh oh!

mergify bot commented Dec 19, 2025

Uh oh!

carsonip left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

carsonip left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

elasticmachine commented Dec 24, 2025

💚 Build Succeeded

History

Uh oh!

carsonip left a comment

Choose a reason for hiding this comment

Uh oh!

carsonip Dec 24, 2025

Choose a reason for hiding this comment

Uh oh!

carsonip left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ericywl commented Dec 19, 2025 •

edited

Loading

carsonip left a comment •

edited

Loading