-
Notifications
You must be signed in to change notification settings - Fork 541
tbs: Add test confirming the potential data race #19948
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
🤖 GitHub commentsJust comment with:
|
|
This pull request does not have a backport label. Could you fix it @ericywl? 🙏
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. The setup looks about right, but the specific case that needs to be tested is not what I'm thinking about.
In this PR, there are txn1 and txn2 both part of trace1, which don't make sense to me. 1 trace should only have 1 root transaction. In your current setup, what should happen to txn2 is undefined.
To be clearer, what I'd like to test is similar: txn1 is root txn of trace1, and txn2 is a child of txn1.
At time t1: apm server receives txn1
t2: background sampling goroutine: apm server makes sampling decision for txn1
t2': apm server receives txn2
t3: background sampling goroutine: marks trace1 as sampled
^ the above is a race, because apm server receives txn2 between t2 and t3, and the result is txn2 is lost forever. If it happens either before t2 or after t3, txn2 is exported correctly.
It gets a bit theoretical but I believe it is possible. Lmk if you have any questions.
1e2035e to
8f27d8c
Compare
675d174 to
a96c059
Compare
carsonip
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
qq: should the test pass or fail here? I assume when the test passes, it means a race happened, right? If so, I think the test is correctly validating the race in its current state
💚 Build Succeeded
History
cc @ericywl |
carsonip
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, the approach looks good now
| func (rw *lockedReadWriter) IsTraceSampled(traceID string) (bool, error) { | ||
| rw.mu.Lock() | ||
| defer rw.mu.Unlock() | ||
| return rw.rw.IsTraceSampled(traceID) | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I acknowledge 8.x doesn't have the optimization to use a RWMutex, but given the ingest hot path is IsTraceSampled->WriteTraceEvent, I wonder if swapping it to RWMutex would further reduce lock contention. However, it may or may not show up in the benchmark, depending on GOMAXPROCS, as well as trace sampled hit rate (which will be 0 because the benchmark always generates new trace ID)
carsonip
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the benchmarks do you mind also running at a higher GOMAXPROCS? e.g. -cpu=14,100 and see if it makes any difference?
Summary
Fix potential data race between
WriteTraceEventinProcessBatchandReadTraceEventin the sampling goroutine.Performance
Baseline
Single Mutex
ShardLockReadWriter