Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix potential deadlock between Revoke and (Grant or Checkpoint) #14080

Merged
merged 1 commit into from
Jun 9, 2022

Conversation

ahrtr
Copy link
Member

@ahrtr ahrtr commented May 29, 2022

The Grant and Checkpoint acquire the lessor.mu firstly, and acquire batchTx lock secondly when they are still holding lessor.mu.

But the Revoke acquires the transaction lock firstly, and secondly acquires the lessor.mu lock when it's still holding the transaction lock.

Obviously there is a potential deadlock between Revoke and (Grant or Checkpoint).

@codecov-commenter
Copy link

codecov-commenter commented May 29, 2022

Codecov Report

Merging #14080 (dabf697) into main (ce77d83) will decrease coverage by 0.20%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main   #14080      +/-   ##
==========================================
- Coverage   75.25%   75.05%   -0.21%     
==========================================
  Files         452      452              
  Lines       36791    36767      -24     
==========================================
- Hits        27688    27594      -94     
- Misses       7369     7428      +59     
- Partials     1734     1745      +11     
Flag Coverage Δ
all 75.05% <100.00%> (-0.21%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
server/lease/lessor.go 88.80% <100.00%> (+0.47%) ⬆️
server/etcdserver/api/rafthttp/peer_status.go 87.87% <0.00%> (-12.13%) ⬇️
server/proxy/grpcproxy/register.go 69.76% <0.00%> (-9.31%) ⬇️
server/etcdserver/api/v3lock/lock.go 61.53% <0.00%> (-8.47%) ⬇️
server/etcdserver/api/rafthttp/peer.go 87.01% <0.00%> (-8.45%) ⬇️
server/etcdserver/api/v2stats/queue.go 20.83% <0.00%> (-7.38%) ⬇️
client/v3/namespace/watch.go 87.87% <0.00%> (-6.07%) ⬇️
client/v3/concurrency/session.go 88.63% <0.00%> (-4.55%) ⬇️
server/proxy/grpcproxy/watch.go 92.48% <0.00%> (-4.05%) ⬇️
client/v3/leasing/cache.go 87.77% <0.00%> (-3.89%) ⬇️
... and 19 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce77d83...dabf697. Read the comment docs.

@ahrtr ahrtr force-pushed the lease_revoke_lock branch from ac12581 to 3fb4683 Compare May 29, 2022 22:52
@ahrtr ahrtr changed the title Release le.mu right after removing the lease ID from the map Fix potential deadlock between Revoke and (Grant or Checkpoint) May 29, 2022
@ahrtr ahrtr force-pushed the lease_revoke_lock branch from 3fb4683 to 7a20221 Compare May 30, 2022 01:25
@ahrtr
Copy link
Member Author

ahrtr commented May 31, 2022

cc @serathius @ptabor @spzala


// We shouldn't delete the lease inside the transaction lock, otherwise
// it may lead to deadlock with Grant or Checkpoint operations, which
// acquire the le.mu firstly and then the batchTx lock.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why Revoke doesn't follow Grant and Checkpoint in executing all changes under lock? If we look into code of similar caching structs (manages both in memory and bbolt representation of state) like RaftCluster, it does everything including backend changes under lock.

Do you know why on line 337 we unlock le.mu?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know why on line 337 we unlock le.mu?

My understanding is the following transaction (which deletes the lease and all keys attached to the lease) may be time consuming if there are lots of keys being attached to the lease. Actually there is no need to acquire the le.mu during the backend transaction.

But both Grant and Checkpoint follow the pattern lease operation --> backend db operation --> lease operation, so for simplicity, they just hold thele.mufor the whole process.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that there is a chance of deadlock here, however I'm worried about removing lease from leaseMap and releasing lock before db operation is executed. This might be totally unfounded worry as backend operation as flushed only once 5 seconds, however I need to think a little on it. Have you thought on it more?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be OK. The le.mu is used to protect the data structures inside lessor, and the backend transaction is protected by separate transaction lock. Just as I mentioned previously, there is no reason to hold the le.mu during the transaction.

Please note that during Grant and Checkpoint, lessor may also persist the lease, but both of them will require the transaction lock before persisting the lease data. So it's safe.

Copy link
Member

@chaochn47 chaochn47 Jun 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my understanding,

before the change

Grant P1 (stands for Process 1)
1. acquires lessor.mu R1 (stands for Resource 1)
2. acquires backend tx R2
3. releases backend tx R2
4. releases lessor.mu R1

Revoke P2
1. acquires lessor.mu R1
2. releases lessor.mu R1
3. acquires backend tx R2
4. acquires lessor.mu R1
5. releases backend tx R2
6. releases lessor.mu R1

There is a potential deadlock for sure

  • P1 is holding R1 and wait to acquire R2
  • P2 is holding R2 and wait to acquire R1

Why not we move P2 step 4 to happen after step 5? Namely delete(le.leaseMap, id) after txn.End() which releases the backend transaction.

The benefit of this idea is the rangeDeleter will eveually call lessor.Detach() to look up leaseID in lessor.leaseMap. So the order is preserved and won't cause the issue that #16035 is trying to fix.

@ahrtr ahrtr force-pushed the lease_revoke_lock branch from 7a20221 to dabf697 Compare June 7, 2022 10:46
@ahrtr
Copy link
Member Author

ahrtr commented Jun 9, 2022

Thanks @serathius , merging...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants