Skip to content

Conversation

@ericywl
Copy link
Contributor

@ericywl ericywl commented Dec 19, 2025

Summary

Fix potential data race between WriteTraceEvent in ProcessBatch and ReadTraceEvent in the sampling goroutine.

Performance

Baseline

goos: darwin
goarch: arm64
pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
cpu: Apple M4 Pro
BenchmarkProcess-14      2718606               415.6 ns/op
BenchmarkProcess-14      2839388               396.6 ns/op
BenchmarkProcess-14      2951276               385.8 ns/op
BenchmarkProcess-14      2897508               390.0 ns/op
BenchmarkProcess-14      3140571               370.0 ns/op

Single Mutex

goos: darwin
goarch: arm64
pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
cpu: Apple M4 Pro
BenchmarkProcess-14      2405176               428.1 ns/op
BenchmarkProcess-14      3260113               344.5 ns/op
BenchmarkProcess-14      3042301               363.3 ns/op
BenchmarkProcess-14      3153118               359.9 ns/op
BenchmarkProcess-14      3190714               362.7 ns/op

ShardLockReadWriter

goos: darwin
goarch: arm64
pkg: github.com/elastic/apm-server/x-pack/apm-server/sampling
cpu: Apple M4 Pro
BenchmarkProcess-14      2899808               408.1 ns/op
BenchmarkProcess-14      3228548               352.0 ns/op
BenchmarkProcess-14      3240921               350.5 ns/op
BenchmarkProcess-14      3181731               371.2 ns/op
BenchmarkProcess-14      3088440               360.9 ns/op

@ericywl ericywl self-assigned this Dec 19, 2025
@ericywl ericywl changed the title Add test confirming the potential data race tbs: Add test confirming the potential data race Dec 19, 2025
@github-actions
Copy link
Contributor

🤖 GitHub comments

Just comment with:

  • run docs-build : Re-trigger the docs validation. (use unformatted text in the comment!)

@mergify
Copy link
Contributor

mergify bot commented Dec 19, 2025

This pull request does not have a backport label. Could you fix it @ericywl? 🙏
To fixup this pull request, you need to add the backport labels for the needed
branches, such as:

  • backport-7.17 is the label to automatically backport to the 7.17 branch.
  • backport-8./d is the label to automatically backport to the 8./d branch. /d is the digit.
  • backport-9./d is the label to automatically backport to the 9./d branch. /d is the digit.
  • backport-active-all is the label that automatically backports to all active branches.
  • backport-active-8 is the label that automatically backports to all active minor branches for the 8 major.
  • backport-active-9 is the label that automatically backports to all active minor branches for the 9 major.

Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. The setup looks about right, but the specific case that needs to be tested is not what I'm thinking about.

In this PR, there are txn1 and txn2 both part of trace1, which don't make sense to me. 1 trace should only have 1 root transaction. In your current setup, what should happen to txn2 is undefined.

To be clearer, what I'd like to test is similar: txn1 is root txn of trace1, and txn2 is a child of txn1.

At time t1: apm server receives txn1
t2: background sampling goroutine: apm server makes sampling decision for txn1
t2': apm server receives txn2
t3: background sampling goroutine: marks trace1 as sampled

^ the above is a race, because apm server receives txn2 between t2 and t3, and the result is txn2 is lost forever. If it happens either before t2 or after t3, txn2 is exported correctly.

It gets a bit theoretical but I believe it is possible. Lmk if you have any questions.

@ericywl ericywl force-pushed the tbs-potential-data-race branch from 1e2035e to 8f27d8c Compare December 22, 2025 05:12
@ericywl ericywl force-pushed the tbs-potential-data-race branch from 675d174 to a96c059 Compare December 22, 2025 05:20
@ericywl ericywl requested a review from carsonip December 22, 2025 05:21
Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

qq: should the test pass or fail here? I assume when the test passes, it means a race happened, right? If so, I think the test is correctly validating the race in its current state

@ericywl ericywl requested a review from carsonip December 23, 2025 12:44
@elasticmachine
Copy link
Contributor

💚 Build Succeeded

History

cc @ericywl

Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks, the approach looks good now

Comment on lines +223 to +227
func (rw *lockedReadWriter) IsTraceSampled(traceID string) (bool, error) {
rw.mu.Lock()
defer rw.mu.Unlock()
return rw.rw.IsTraceSampled(traceID)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I acknowledge 8.x doesn't have the optimization to use a RWMutex, but given the ingest hot path is IsTraceSampled->WriteTraceEvent, I wonder if swapping it to RWMutex would further reduce lock contention. However, it may or may not show up in the benchmark, depending on GOMAXPROCS, as well as trace sampled hit rate (which will be 0 because the benchmark always generates new trace ID)

Copy link
Member

@carsonip carsonip left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the benchmarks do you mind also running at a higher GOMAXPROCS? e.g. -cpu=14,100 and see if it makes any difference?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants