Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KAFKA-18575: Transaction Version 2 doesn't correctly handle race condition with completing and new transaction #18604

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

jolshan
Copy link
Member

@jolshan jolshan commented Jan 17, 2025

There is a subtle race condition with transactions V2 if a transaction is still completing when checking if we need to add a partition, but it completes when the request reaches the coordinator.

One approach was to remove the verification for TV2 and just check the epoch on write, but a simpler one is to simply return concurrent transactions from the partition leader (before attempting to add the partition). I've done this and added a test for this behavior.

Locally, I reproduced the race but adding a 1 second sleep when handling the WriteTxnMarkersRequest and a 2 second delay before adding the partition to the AddPartitionsToTxnManager. Without this change, the race happened on every second transaction as the first one completed. With this change, the error went away.

As a followup, we may want to clean up some of the code and comments with respect to verification as the code is used by both TV0 + verification and TV2. But that doesn't need to complete for 4.0. This does :)

@github-actions github-actions bot added core Kafka Broker small Small PRs labels Jan 17, 2025
@jolshan jolshan added the transactions Transactions and EOS label Jan 17, 2025
@@ -1052,7 +1057,7 @@ class UnifiedLog(@volatile var logStartOffset: Long,
}

private def batchMissingRequiredVerification(batch: MutableRecordBatch, requestVerificationGuard: VerificationGuard): Boolean = {
producerStateManager.producerStateManagerConfig().transactionVerificationEnabled() && !batch.isControlBatch &&
producerStateManager.producerStateManagerConfig().transactionVerificationEnabled() &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason to move the check for control batch out of this function?

Copy link
Member Author

@jolshan jolshan Jan 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the check to outside the if condition (short-circuiting the stament before hasOngoingTransaction otherwise we can't write commit markers)
I can put this back, but it will just be a duplicate check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, this is good. I'm just thinking if having this function makes the code better or worse -- it's really called exactly from one place and looks like the order of checks is important and can lead to subtle bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Kafka Broker small Small PRs transactions Transactions and EOS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants