[fix] Write stuck due to pending add callback by multiple threads #4557
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Motivation
Background: the normal steps of adding an entry
PendingAddOp.writeComplete
after receiving the response from BK servers.Background: the steps of disconnection
PendingAddOp.writeComplete
. You can reproduce this flow by the new testtestAddEntriesCallbackWithBKClientThread
Issue-1: write stuck due to pending add callback by multiple threads
3
3
2
client->BK1
client->BK2
client-> BK3
ack
:1/3
ack
:2/3
complete
since ack quorum is2/3
PendingAddOp.writeComplete
thread
:bookkeeper workers
thread
:client-server io
Since there are multiple threads that will trigger all successful callbacks in the pending queue, it may cause the following race condition[code-2]
thread-1
andthread-2
may be triggered by differentPendingAddOps
thread-1
thread-2
success
success
queue.pop
queue.pop
[1] code link: https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/PendingAddOp.java#L307
[2] code-link: https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerHandle.java#L2092-L2124
Issue-2: ledger will be closed with a incorrect length
Since the task that triggers all successful callbacks in the pending queue may be run in
IO
thread, the task "triggers all successful callbacks in the pending queue" and closing ledger may concurrectly executeworker-thread
io-thread
success
ledger.length
which was popped out from the queue[code-3]queue.pop
and pop nothingledger.LAC
The variables
ledger.LAC
andledger.length
do not match[3] code-link: https://github.com/apache/bookkeeper/blob/master/bookkeeper-server/src/main/java/org/apache/bookkeeper/client/LedgerHandle.java#L2076-L2084
The issue we encountered
A pulsar topic is stuck at
ClosingLedger
statepulsar topic stats
logs
Changes
Switch the thread to
Bookkeeper works
if the connection is broken.