Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MINOR: Improve Producer error message for failed metadata update #18587

Open
wants to merge 1 commit into
base: trunk
Choose a base branch
from

Conversation

mjsax
Copy link
Member

@mjsax mjsax commented Jan 17, 2025

We should provide the same informative error message for both timeout cases.

We should provide the same informative error message for both timeout
cases.
@mjsax
Copy link
Member Author

mjsax commented Jan 17, 2025

Java 17

Found 6 test failures:
FAILED ❌ SaslOAuthBearerSslEndToEndAuthorizationTest > testNoConsumeWithoutDescribeAclViaSubscribe(String, String).quorum=kraft.groupProtocol=classic
FAILED ❌ MetricsDuringTopicCreationDeletionTest > "testMetricsDuringTopicCreateDelete(String).quorum=kraft"
FAILED ❌ ClientIdQuotaTest > testThrottledProducerConsumer(String, String).quorum=kraft.groupProtocol=classic
FAILED ❌ ClientIdQuotaTest > testThrottledProducerConsumer(String, String).quorum=kraft.groupProtocol=consumer
FAILED ❌ ClientIdQuotaTest > testQuotaOverrideDelete(String, String).quorum=kraft.groupProtocol=classic
FAILED ❌ ClientIdQuotaTest > testQuotaOverrideDelete(String, String).quorum=kraft.groupProtocol=consumer

Java 23

Found 5 test failures:
FAILED ❌ SaslPlainPlaintextConsumerTest > testCoordinatorFailover(String, String).quorum=kraft.groupProtocol=consumer
FAILED ❌ ClientIdQuotaTest > testThrottledProducerConsumer(String, String).quorum=kraft.groupProtocol=classic
FAILED ❌ ClientIdQuotaTest > testThrottledProducerConsumer(String, String).quorum=kraft.groupProtocol=consumer
FAILED ❌ ClientIdQuotaTest > testQuotaOverrideDelete(String, String).quorum=kraft.groupProtocol=classic
FAILED ❌ ClientIdQuotaTest > testQuotaOverrideDelete(String, String).quorum=kraft.groupProtocol=consumer
Found 1 flaky test failures:
FLAKY ⚠️  AbstractCoordinatorTest > testWakeupAfterSyncGroupReceivedExternalCompletion()

Copy link
Collaborator

@kirktrue kirktrue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR @mjsax! It seems like a straightforward improvement to the error logging. Just had one very minor question based on curiosity. Thanks!

Comment on lines +1132 to +1138
private String getErrorMessage(Integer partitionsCount, String topic, Integer partition, long maxWaitMs) {
return partitionsCount == null ?
String.format("Topic %s not present in metadata after %d ms.",
topic, maxWaitMs) :
String.format("Partition %d of topic %s with partition count %d is not present in metadata after %d ms.",
partition, topic, partitionsCount, maxWaitMs);
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any value in including the partition value in the error message if it's non-null?

On line 1085 the method waitOnMetadata() exits when partitionsCount is non-null but partition is null. It seems like it's possible for partitionsCount to be null but partition non-null.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are 3 cases too:

  1. partitionsCount == null meaning we don't know how many partitions there are
  2. partitionsCount != null && partition == null meaning we do know how many partitions there are but no partition was specified
  3. partitionsCount != null && partition != null meaning we do know how many partitions there are and a partition was specified

Seems to me that 3 messages would be best so that we never get a "null" as an ugly insert. wdyt?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any value in including the partition value in the error message if it's non-null?

Yes. partition is an input parameter to the method, and it's only null if the caller does not care about the metadata of a specific partitions (as pointed out by Andrew). So it's if not-null, the user cares and we should log it.

If partition != null, it helps to identify the case of a non-existing partition the user wants to write into. Assume a topic with 4 partitions, and user specific partition 5 as explicit argument when calling send(). The producer would try to learn about the leader broker for partition 5, but it does not exist, and it would loop until max.block.ms is exhausted and finally fail with: Partition 5 of topic <FOO> with partition count 4 is not present in metadata after XXX ms. -- Then we can suspect that the topic may really only have 4 partitions... W/o Partition 5 it's totally unclear why the metadata could not be found and why we hit max.block.ms on send(); the generic error Topic %s not present in metadata after %d ms. does not help us to identify this case.

Or did you mean, if partition == null? For this case, I agree it's not really helpful to log partition (as the user does not care anyway), but it think we don't log it anyway given the current code... (cf below)

Seems to me that 3 messages would be best so that we never get a "null" as an ugly insert. wdyt?

I believe the code is correct as-is. There is only two cases:

  1. partition == null. If we get metadata successfully, it implies that partitionCount != null and we just move on and don't throw TimeoutException. If we cannot get any metadata for the topic, we stay with partitionCount == null and use Topic %s not present in metadata after %d ms. for the TimeoutException.

  2. partition != null. For this case, if we get metadata successfully, we have a partitionCount, but we might still fail, if we don't get the metadata for the specified partition. For this case, we would use Partition %d of topic %s with partition count %d is not present in metadata after %d ms as error message. The other failure case is, that we don't get any metadata for the topic at all, and thus partitionCount == null and we still log the right thing. The fact that partition != null does not matter if partitionCount == null.

Hence, case (3) is not an error case... If we did get metadata (partitionCount != null) and if the user does not care (partition == null), we will never throw TimeoutException.

@kirktrue
Copy link
Collaborator

BTW, I saw several other branches that are recently failing with similar test failures, including one of mine 🤷‍♂️

Comment on lines +1132 to +1138
private String getErrorMessage(Integer partitionsCount, String topic, Integer partition, long maxWaitMs) {
return partitionsCount == null ?
String.format("Topic %s not present in metadata after %d ms.",
topic, maxWaitMs) :
String.format("Partition %d of topic %s with partition count %d is not present in metadata after %d ms.",
partition, topic, partitionsCount, maxWaitMs);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there are 3 cases too:

  1. partitionsCount == null meaning we don't know how many partitions there are
  2. partitionsCount != null && partition == null meaning we do know how many partitions there are but no partition was specified
  3. partitionsCount != null && partition != null meaning we do know how many partitions there are and a partition was specified

Seems to me that 3 messages would be best so that we never get a "null" as an ugly insert. wdyt?

@mjsax
Copy link
Member Author

mjsax commented Jan 18, 2025

BTW, I saw several other branches that are recently failing with similar test failures, including one of mine 🤷‍♂️

Thanks. There is already a ticket for ClientIdQuotaTest, and it did actually pass locally for me. So I think we can ignore this one. -- Just wanted to make sure for the other test, that's it's flakiness. I think we will need to merge this PR w/o a green build, if we want to get it into 4.0...

Java 17:

Found 5 test failures:
FAILED ❌ DeleteSegmentsByRetentionTimeTest > executeTieredStorageTest(String, String).quorum=kraft.groupProtocol=consumer
FAILED ❌ ClientIdQuotaTest > testThrottledProducerConsumer(String, String).quorum=kraft.groupProtocol=classic
FAILED ❌ ClientIdQuotaTest > testThrottledProducerConsumer(String, String).quorum=kraft.groupProtocol=consumer
FAILED ❌ ClientIdQuotaTest > testQuotaOverrideDelete(String, String).quorum=kraft.groupProtocol=classic
FAILED ❌ UserClientIdQuotaTest > testQuotaOverrideDelete(String, String).quorum=kraft.groupProtocol=consumer
Found 2 flaky test failures:
FLAKY ⚠️  DynamicConnectionQuotaTest > "testDynamicConnectionQuota(String).quorum=kraft"
FLAKY ⚠️  AbstractCoordinatorTest > testWakeupAfterSyncGroupSentExternalCompletion()

Java 23:

Found 3 test failures:
FAILED ❌ ClientIdQuotaTest > testThrottledProducerConsumer(String, String).quorum=kraft.groupProtocol=classic
FAILED ❌ ClientIdQuotaTest > testThrottledProducerConsumer(String, String).quorum=kraft.groupProtocol=consumer
FAILED ❌ ClientIdQuotaTest > testQuotaOverrideDelete(String, String).quorum=kraft.groupProtocol=classic
Found 2 flaky test failures:
FLAKY ⚠️  AbstractCoordinatorTest > testWakeupAfterSyncGroupReceivedExternalCompletion()
FLAKY ⚠️  EagerConsumerCoordinatorTest > testOutdatedCoordinatorAssignment()

Copy link
Member

@AndrewJSchofield AndrewJSchofield left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm. Thanks for the patch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants