Fix flaky node addition race condition when applying labels#22971
Fix flaky node addition race condition when applying labels#22971medyagh wants to merge 1 commit into
Conversation
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: medyagh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
/ok-to-test |
There was a problem hiding this comment.
Pull request overview
This PR aims to reduce a reported ~10% flake rate on KVM Linux by addressing a race condition during node addition, where kubectl label / kubectl taint can fail transiently while the node is still registering with the API server.
Changes:
- Wrap node labeling in an exponential-backoff retry loop with a per-attempt timeout.
- Wrap secondary control-plane untainting in an exponential-backoff retry loop with a per-attempt timeout.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // example: | ||
| // sudo /var/lib/minikube/binaries/<version>/kubectl --kubeconfig=/var/lib/minikube/kubeconfig taint nodes test-357 node-role.kubernetes.io/control-plane:NoSchedule- | ||
| cmd := exec.CommandContext(ctx, "sudo", kubectlPath(cfg), fmt.Sprintf("--kubeconfig=%s", path.Join(vmpath.GuestPersistentDir, "kubeconfig")), | ||
| "taint", "nodes", config.MachineName(cfg, n), "node-role.kubernetes.io/control-plane:NoSchedule-") |
| if err := retry.Expo(labelNode, 100*time.Millisecond, 1*time.Minute); err != nil { | ||
| return err |
| if err := retry.Expo(untaintNode, 100*time.Millisecond, 1*time.Minute); err != nil { | ||
| return err |
|
kvm2 driver with docker runtime DetailsTimes for minikube start: 41.5s 37.2s 37.5s 37.8s 38.2s Times for minikube ingress: 18.8s 19.3s 19.3s 19.8s 19.4s docker driver with docker runtime DetailsTimes for minikube start: 22.0s 21.8s 22.1s 22.5s 22.8s Times for minikube (PR 22971) ingress: 9.7s 12.2s 12.2s 12.7s 12.7s docker driver with containerd runtime DetailsTimes for minikube start: 20.1s 21.0s 17.4s 20.9s 20.4s Times for minikube ingress: 24.7s 24.7s 25.2s 22.7s 25.2s |
|
@medyagh: The following tests failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
currently 10% flake rate on KVM linux