Skip to content

Fix flaky node addition race condition when applying labels#22971

Open
medyagh wants to merge 1 commit into
kubernetes:masterfrom
medyagh:addnode_flake
Open

Fix flaky node addition race condition when applying labels#22971
medyagh wants to merge 1 commit into
kubernetes:masterfrom
medyagh:addnode_flake

Conversation

@medyagh
Copy link
Copy Markdown
Member

@medyagh medyagh commented May 11, 2026

currently 10% flake rate on KVM linux

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. labels May 11, 2026
@medyagh medyagh requested a review from Copilot May 11, 2026 22:11
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: medyagh

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 11, 2026
@medyagh
Copy link
Copy Markdown
Member Author

medyagh commented May 11, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added the ok-to-test Indicates a non-member PR verified by an org member that is safe to test. label May 11, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to reduce a reported ~10% flake rate on KVM Linux by addressing a race condition during node addition, where kubectl label / kubectl taint can fail transiently while the node is still registering with the API server.

Changes:

  • Wrap node labeling in an exponential-backoff retry loop with a per-attempt timeout.
  • Wrap secondary control-plane untainting in an exponential-backoff retry loop with a per-attempt timeout.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

// example:
// sudo /var/lib/minikube/binaries/<version>/kubectl --kubeconfig=/var/lib/minikube/kubeconfig taint nodes test-357 node-role.kubernetes.io/control-plane:NoSchedule-
cmd := exec.CommandContext(ctx, "sudo", kubectlPath(cfg), fmt.Sprintf("--kubeconfig=%s", path.Join(vmpath.GuestPersistentDir, "kubeconfig")),
"taint", "nodes", config.MachineName(cfg, n), "node-role.kubernetes.io/control-plane:NoSchedule-")
Comment on lines +1094 to +1095
if err := retry.Expo(labelNode, 100*time.Millisecond, 1*time.Minute); err != nil {
return err
Comment on lines +1115 to +1116
if err := retry.Expo(untaintNode, 100*time.Millisecond, 1*time.Minute); err != nil {
return err
@minikube-pr-bot
Copy link
Copy Markdown

kvm2 driver with docker runtime

┌────────────────┬──────────┬────────────────────────┐
│    COMMAND     │ MINIKUBE │ MINIKUBE  ( PR 22971 ) │
├────────────────┼──────────┼────────────────────────┤
│ minikube start │ 38.4s    │ 38.9s                  │
│ enable ingress │ 19.3s    │ 18.5s                  │
└────────────────┴──────────┴────────────────────────┘
Details

Times for minikube start: 41.5s 37.2s 37.5s 37.8s 38.2s
Times for minikube (PR 22971) start: 40.9s 37.8s 38.0s 40.1s 37.8s

Times for minikube ingress: 18.8s 19.3s 19.3s 19.8s 19.4s
Times for minikube (PR 22971) ingress: 18.8s 19.3s 15.8s 18.8s 19.8s

docker driver with docker runtime

┌────────────────┬──────────┬────────────────────────┐
│    COMMAND     │ MINIKUBE │ MINIKUBE  ( PR 22971 ) │
├────────────────┼──────────┼────────────────────────┤
│ minikube start │ 22.2s    │ 22.7s                  │
│ enable ingress │ 11.9s    │ 11.9s                  │
└────────────────┴──────────┴────────────────────────┘
Details

Times for minikube start: 22.0s 21.8s 22.1s 22.5s 22.8s
Times for minikube (PR 22971) start: 23.2s 22.6s 22.8s 22.7s 22.2s

Times for minikube (PR 22971) ingress: 9.7s 12.2s 12.2s 12.7s 12.7s
Times for minikube ingress: 12.7s 12.2s 12.2s 12.7s 9.7s

docker driver with containerd runtime

┌────────────────┬──────────┬────────────────────────┐
│    COMMAND     │ MINIKUBE │ MINIKUBE  ( PR 22971 ) │
├────────────────┼──────────┼────────────────────────┤
│ minikube start │ 20.0s    │ 20.0s                  │
│ enable ingress │ 24.5s    │ 24.6s                  │
└────────────────┴──────────┴────────────────────────┘
Details

Times for minikube start: 20.1s 21.0s 17.4s 20.9s 20.4s
Times for minikube (PR 22971) start: 18.3s 20.6s 19.7s 21.0s 20.6s

Times for minikube ingress: 24.7s 24.7s 25.2s 22.7s 25.2s
Times for minikube (PR 22971) ingress: 24.7s 24.7s 24.7s 24.7s 24.2s

@k8s-ci-robot
Copy link
Copy Markdown
Contributor

@medyagh: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-minikube-none-docker-linux-x86 f972c98 link true /test pull-minikube-none-docker-linux-x86
pull-minikube-docker-containerd-linux-arm f972c98 link false /test pull-minikube-docker-containerd-linux-arm
pull-minikube-docker-crio-linux-x86 f972c98 link false /test pull-minikube-docker-crio-linux-x86
pull-minikube-docker-docker-linux-arm f972c98 link true /test pull-minikube-docker-docker-linux-arm
pull-minikube-docker-docker-linux-x86 f972c98 link true /test pull-minikube-docker-docker-linux-x86
pull-minikube-kvm-docker-linux-x86 f972c98 link true /test pull-minikube-kvm-docker-linux-x86
pull-minikube-kvm-crio-linux-x86 f972c98 link false /test pull-minikube-kvm-crio-linux-x86

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants