What happened:
The disaggregatedset e2e test Rolling Update with Coordinated Drain / should complete rolling update with both roles scaling together (disaggregatedset/test/e2e/e2e_test.go:213) is flaky:
[FAILED] Revision <newRev> has prefill replicas but no decode (orphaned)
Example: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_lws/752/pull-lws-test-e2e-main-1-34/2049099489146310656
Root cause:
kubectl.ForSingleActiveRevision (disaggregatedset/test/utils/kubectl/waiters.go:68) only checks that exactly one revision has total spec.replicas > 0. It can return during an intermediate state where the new revision has only one role scaled up:
- old revision: prefill=0, decode=0 (drained)
- new revision: prefill=1, decode=0 (decode scale-up not yet applied)
The next assertion then sees prefill>0, decode=0 and reports the new revision as orphaned.
Suggested fix:
In ForSingleActiveRevision, also require that every role of the active revision has replicas > 0 before returning.
What happened:
The disaggregatedset e2e test
Rolling Update with Coordinated Drain / should complete rolling update with both roles scaling together(disaggregatedset/test/e2e/e2e_test.go:213) is flaky:Example: https://prow.k8s.io/view/gs/kubernetes-ci-logs/pr-logs/pull/kubernetes-sigs_lws/752/pull-lws-test-e2e-main-1-34/2049099489146310656
Root cause:
kubectl.ForSingleActiveRevision(disaggregatedset/test/utils/kubectl/waiters.go:68) only checks that exactly one revision has totalspec.replicas > 0. It can return during an intermediate state where the new revision has only one role scaled up:The next assertion then sees prefill>0, decode=0 and reports the new revision as orphaned.
Suggested fix:
In
ForSingleActiveRevision, also require that every role of the active revision hasreplicas > 0before returning.