[v26.1.x] kafka: make metadata return term 0 when missing info#30781
Merged
andrwng merged 8 commits intoJun 18, 2026
Merged
Conversation
Brings in the franz-go fix that retries metadata when the first metadata load reports per-partition errors. The metadata handler change in this PR returns leader_not_available for a partition whose term is not yet known (e.g. just after topic creation); without this fix franz-go producers ignore that error on the first load and stall until the metadata max age. https://github.com/twmb/franz-go/blob/8268a5d078c01d29ca0daa1748fac264e0fc2f11/pkg/kgo/metadata.go#L1011 (cherry picked from commit aad6922)
This is a functional revert of redpanda-data@a441290, making the handler return term_id(0) instead of -1 when there is no term metadata for a given partition. The Java client, for instance, would treat -1 as a signal that this broker doesn't have reliable metadata, and would drop cached epochs as a result[1], which interferes with truncation detection from KIP-320. Note that returning -1 was previously done as a fix for flakiness of PartitionBalancerTest. This commit also updates the metadata handler to return an error: - In the Java client, the term_id(0) is still used, but the error is treated as a signal to request another update immediately[2]. - Franz go skips processing the partition altogether if there is an error, regardless of whether there is a missing term, opting to retry later[3]. Note that the lack of error handling in addition to returning -1 is the cause of flakiness of WriteCachingFailureInjectionE2ETest. With this commit, both PartitionBalancerTest and WriteCachingFailureInjectionE2ETest pass reliably. Relevant Redpanda commits in reverse chronological order: - redpanda-data@a441290 - redpanda-data@7a60b75 - redpanda-data@86d583b 1. https://github.com/apache/kafka/blob/9529003fffd93a9d7e3f6ff7ab081ed84942fd13/clients/src/main/java/org/apache/kafka/clients/Metadata.java#L596-L601 2. https://github.com/apache/kafka/blob/9529003fffd93a9d7e3f6ff7ab081ed84942fd13/clients/src/main/java/org/apache/kafka/clients/Metadata.java#L519-L528 3. https://github.com/twmb/franz-go/blob/8268a5d078c01d29ca0daa1748fac264e0fc2f11/pkg/kgo/metadata.go#L1011 (cherry picked from commit 1e1cdfe)
A broker can briefly return a retriable error such as LEADER_NOT_AVAILABLE for a partition whose leader it has not yet learned -- for example right after topic creation or while leadership moves between brokers. rpk commands that resolve offsets via kadm then hard-fail on that transient condition, because kadm cannot route the request to a leader. Add RetryListOffsets, which runs a kadm offset-listing call and retries for a bounded time while the failure is composed only of retriable errors, whether surfaced as a top-level shard error or as a per-partition error in the response. (cherry picked from commit a456f4c)
`consume -o start:end` resolves the bounded range by listing start/end offsets via kadm, which hard-failed if a partition transiently reported a retriable error (e.g. LEADER_NOT_AVAILABLE right after topic creation). Route the offset listing through RetryListOffsets so range resolution rides through the transient condition; this also surfaces per-partition offset errors during resolution. (cherry picked from commit 57b0ed6)
`rpk topic analyze` lists offsets across all of a topic's partitions; if any partition transiently reported a retriable error (e.g. LEADER_NOT_AVAILABLE shortly after the topic was created), the whole command failed. Route the offset listing through RetryListOffsets. (cherry picked from commit 1316a6c)
`rpk group seek` resolves its target offsets by listing them via kadm. If a partition transiently reported a retriable error (e.g. LEADER_NOT_AVAILABLE while leadership was still settling after a restart or movement), the seek failed outright. Route the offset listing through RetryListOffsets. (cherry picked from commit 349e9a0)
When printing the partition section, `rpk topic describe` listed whatever metadata snapshot it first got. A broker can briefly return a retriable error (e.g. LEADER_NOT_AVAILABLE) for a partition whose leader it has not yet learned -- right after topic creation or while leadership moves between nodes -- which made the partition show a load error and dropped it from offset listing, yielding an incomplete describe. Refetch metadata for a bounded time while any partition reports a retriable error, so the listing is complete. Non-retriable errors (e.g. recovery mode's policy_violation) are returned immediately, and the retry only applies when the partition section is requested. (cherry picked from commit c177e95)
When querying offset-for-leader-epoch against a freshly-created read replica, a broker can briefly report the partition's leader as unavailable while leadership propagates, so kcl returns no result. The check indexed [0] before its retry handling, raising IndexError instead of retrying. Treat an empty result as a transient condition and retry, matching the existing handling for a negative end offset. (cherry picked from commit 42ab146)
andrwng
approved these changes
Jun 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Backport of PR #30538
Conflict details
The following files were cherry-picked and may need regeneration:
These files were accepted as-is from the source branch. Before merging,
regenerate them on the target branch to ensure they're correct. For example:
go mod tidy