-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix potential deadlock between Revoke and (Grant or Checkpoint) #14080
Conversation
Codecov Report
@@ Coverage Diff @@
## main #14080 +/- ##
==========================================
- Coverage 75.25% 75.05% -0.21%
==========================================
Files 452 452
Lines 36791 36767 -24
==========================================
- Hits 27688 27594 -94
- Misses 7369 7428 +59
- Partials 1734 1745 +11
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
|
||
// We shouldn't delete the lease inside the transaction lock, otherwise | ||
// it may lead to deadlock with Grant or Checkpoint operations, which | ||
// acquire the le.mu firstly and then the batchTx lock. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why Revoke doesn't follow Grant
and Checkpoint
in executing all changes under lock? If we look into code of similar caching structs (manages both in memory and bbolt representation of state) like RaftCluster
, it does everything including backend changes under lock.
Do you know why on line 337 we unlock le.mu
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you know why on line 337 we unlock le.mu?
My understanding is the following transaction (which deletes the lease and all keys attached to the lease) may be time consuming if there are lots of keys being attached to the lease. Actually there is no need to acquire the le.mu during the backend transaction.
But both Grant
and Checkpoint
follow the pattern lease operation --> backend db operation --> lease operation
, so for simplicity, they just hold thele.mu
for the whole process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that there is a chance of deadlock here, however I'm worried about removing lease from leaseMap and releasing lock before db operation is executed. This might be totally unfounded worry as backend operation as flushed only once 5 seconds, however I need to think a little on it. Have you thought on it more?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be OK. The le.mu
is used to protect the data structures inside lessor
, and the backend transaction is protected by separate transaction lock. Just as I mentioned previously, there is no reason to hold the le.mu
during the transaction.
Please note that during Grant
and Checkpoint
, lessor may also persist the lease, but both of them will require the transaction lock before persisting the lease data. So it's safe.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my understanding,
before the change
Grant P1 (stands for Process 1)
1. acquires lessor.mu R1 (stands for Resource 1)
2. acquires backend tx R2
3. releases backend tx R2
4. releases lessor.mu R1
Revoke P2
1. acquires lessor.mu R1
2. releases lessor.mu R1
3. acquires backend tx R2
4. acquires lessor.mu R1
5. releases backend tx R2
6. releases lessor.mu R1
There is a potential deadlock for sure
- P1 is holding R1 and wait to acquire R2
- P2 is holding R2 and wait to acquire R1
Why not we move P2 step 4 to happen after step 5? Namely delete(le.leaseMap, id)
after txn.End()
which releases the backend transaction.
The benefit of this idea is the rangeDeleter will eveually call lessor.Detach() to look up leaseID in lessor.leaseMap
. So the order is preserved and won't cause the issue that #16035 is trying to fix.
Thanks @serathius , merging... |
The Grant and Checkpoint acquire the
lessor.mu
firstly, and acquirebatchTx lock
secondly when they are still holdinglessor.mu
.But the Revoke acquires the
transaction lock
firstly, and secondly acquires thelessor.mu
lock when it's still holding the transaction lock.Obviously there is a potential deadlock between Revoke and (Grant or Checkpoint).