Skip to content

[v26.1.x] kafka: make metadata return term 0 when missing info#30781

Merged
andrwng merged 8 commits into
redpanda-data:v26.1.xfrom
vbotbuildovich:ai-backport-pr-30538-v26.1.x-1781249226
Jun 18, 2026
Merged

[v26.1.x] kafka: make metadata return term 0 when missing info#30781
andrwng merged 8 commits into
redpanda-data:v26.1.xfrom
vbotbuildovich:ai-backport-pr-30538-v26.1.x-1781249226

Conversation

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Backport of PR #30538

Conflict details

  • aad6922 (tests/go/kgo-verifier/go.mod): franz-go version bump conflicted with the target branch's existing pin; took the incoming versions (franz-go v1.21.2, kadm v1.17.2, kmsg v1.13.1) to apply the commit's intent of pulling in the metadata-retry fix.
  • aad6922 (tests/go/kgo-verifier/go.sum): accepted theirs (generated lockfile) to match the bumped go.mod.

⚠️ Generated files

The following files were cherry-picked and may need regeneration:

  • tests/go/kgo-verifier/go.sum

These files were accepted as-is from the source branch. Before merging,
regenerate them on the target branch to ensure they're correct. For example:

  • go.sum: run go mod tidy

andrwng added 8 commits June 12, 2026 07:28
Brings in the franz-go fix that retries metadata when the first
metadata load reports per-partition errors. The metadata handler
change in this PR returns leader_not_available for a partition whose
term is not yet known (e.g. just after topic creation); without this
fix franz-go producers ignore that error on the first load and stall
until the metadata max age.

https://github.com/twmb/franz-go/blob/8268a5d078c01d29ca0daa1748fac264e0fc2f11/pkg/kgo/metadata.go#L1011
(cherry picked from commit aad6922)
This is a functional revert of
redpanda-data@a441290,
making the handler return term_id(0) instead of -1 when there is no term
metadata for a given partition. The Java client, for instance, would
treat -1 as a signal that this broker doesn't have reliable metadata,
and would drop cached epochs as a result[1], which interferes with
truncation detection from KIP-320.

Note that returning -1 was previously done as a fix for flakiness of
PartitionBalancerTest.

This commit also updates the metadata handler to return an error:
- In the Java client, the term_id(0) is still used, but the error is
  treated as a signal to request another update immediately[2].
- Franz go skips processing the partition altogether if there is an
  error, regardless of whether there is a missing term, opting to retry
  later[3].

Note that the lack of error handling in addition to returning -1 is the
cause of flakiness of WriteCachingFailureInjectionE2ETest.

With this commit, both PartitionBalancerTest and
WriteCachingFailureInjectionE2ETest pass reliably.

Relevant Redpanda commits in reverse chronological order:
- redpanda-data@a441290
- redpanda-data@7a60b75
- redpanda-data@86d583b

1. https://github.com/apache/kafka/blob/9529003fffd93a9d7e3f6ff7ab081ed84942fd13/clients/src/main/java/org/apache/kafka/clients/Metadata.java#L596-L601
2. https://github.com/apache/kafka/blob/9529003fffd93a9d7e3f6ff7ab081ed84942fd13/clients/src/main/java/org/apache/kafka/clients/Metadata.java#L519-L528
3. https://github.com/twmb/franz-go/blob/8268a5d078c01d29ca0daa1748fac264e0fc2f11/pkg/kgo/metadata.go#L1011

(cherry picked from commit 1e1cdfe)
A broker can briefly return a retriable error such as LEADER_NOT_AVAILABLE
for a partition whose leader it has not yet learned -- for example right
after topic creation or while leadership moves between brokers. rpk
commands that resolve offsets via kadm then hard-fail on that transient
condition, because kadm cannot route the request to a leader.

Add RetryListOffsets, which runs a kadm offset-listing call and retries for
a bounded time while the failure is composed only of retriable errors,
whether surfaced as a top-level shard error or as a per-partition error in
the response.

(cherry picked from commit a456f4c)
`consume -o start:end` resolves the bounded range by listing start/end
offsets via kadm, which hard-failed if a partition transiently reported a
retriable error (e.g. LEADER_NOT_AVAILABLE right after topic creation).
Route the offset listing through RetryListOffsets so range resolution
rides through the transient condition; this also surfaces per-partition
offset errors during resolution.

(cherry picked from commit 57b0ed6)
`rpk topic analyze` lists offsets across all of a topic's partitions; if
any partition transiently reported a retriable error (e.g.
LEADER_NOT_AVAILABLE shortly after the topic was created), the whole
command failed. Route the offset listing through RetryListOffsets.

(cherry picked from commit 1316a6c)
`rpk group seek` resolves its target offsets by listing them via kadm. If
a partition transiently reported a retriable error (e.g.
LEADER_NOT_AVAILABLE while leadership was still settling after a restart or
movement), the seek failed outright. Route the offset listing through
RetryListOffsets.

(cherry picked from commit 349e9a0)
When printing the partition section, `rpk topic describe` listed whatever
metadata snapshot it first got. A broker can briefly return a retriable
error (e.g. LEADER_NOT_AVAILABLE) for a partition whose leader it has not
yet learned -- right after topic creation or while leadership moves
between nodes -- which made the partition show a load error and dropped it
from offset listing, yielding an incomplete describe.

Refetch metadata for a bounded time while any partition reports a
retriable error, so the listing is complete. Non-retriable errors (e.g.
recovery mode's policy_violation) are returned immediately, and the retry
only applies when the partition section is requested.

(cherry picked from commit c177e95)
When querying offset-for-leader-epoch against a freshly-created read
replica, a broker can briefly report the partition's leader as
unavailable while leadership propagates, so kcl returns no result. The
check indexed [0] before its retry handling, raising IndexError instead
of retrying. Treat an empty result as a transient condition and retry,
matching the existing handling for a negative end offset.

(cherry picked from commit 42ab146)
@vbotbuildovich vbotbuildovich added this to the v26.1.x-next milestone Jun 12, 2026
@vbotbuildovich vbotbuildovich added the kind/backport PRs targeting a stable branch label Jun 12, 2026
@vbotbuildovich vbotbuildovich requested review from a team and kbatuigas as code owners June 12, 2026 07:28
@vbotbuildovich vbotbuildovich requested a review from andrwng June 12, 2026 07:28
@andrwng andrwng merged commit 30e3f8f into redpanda-data:v26.1.x Jun 18, 2026
29 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants