Skip to content

Conversation

@marcusramberg
Copy link
Contributor

We're running local-provisioner to provide local storage for CI runners where nodes come and go pretty frequently.
We observe that the provisioner is trying run clean up on nodes that are already gone, which causes helper pods
to be stuck in pending state as they cannot be scheduled.

This PR adds a check to see if the node still exists before trying to clean up the node.

@marcusramberg marcusramberg changed the title fix: don't try to clean up pvcs on nodes that are gone fix: don't try to clean up pvs on nodes that are gone Feb 27, 2025
@marcusramberg marcusramberg force-pushed the marcus/ephemeral_fix branch 4 times, most recently from 69c6989 to 1e1388b Compare March 6, 2025 09:07
@marcusramberg
Copy link
Contributor Author

@derekbit Thoughts about this PR? We're running it in production from a fork now and it has resolved our issue of stuck pvs from old nodes and stuck helper prs trying to schedule on non-existing nodes. I guess it would also address the issues you're seeing in #416 with stuck pvs from previous runs?

@github-actions
Copy link

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale label Apr 25, 2025
@derekbit derekbit removed the stale label Apr 25, 2025
@github-actions
Copy link

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale label Jun 10, 2025
@derekbit derekbit removed the stale label Jun 10, 2025
@github-actions
Copy link

This PR is stale because it has been open 45 days with no activity. Remove stale label or comment or this will be closed in 10 days.

@github-actions github-actions bot added the stale label Jul 26, 2025
@derekbit derekbit removed the stale label Jul 27, 2025
provisioner.go Outdated
Comment on lines 487 to 493
if _, err := p.kubeClient.CoreV1().Nodes().Get(context.TODO(), node, metav1.GetOptions{}); err != nil {
logrus.Infof("Node %v does not exist, skipping cleanup of volume %v", node, pv.Name)
return nil
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@marcusramberg WDYT?

Suggested change
if _, err := p.kubeClient.CoreV1().Nodes().Get(context.TODO(), node, metav1.GetOptions{}); err != nil {
logrus.Infof("Node %v does not exist, skipping cleanup of volume %v", node, pv.Name)
return nil
}
if _, err := p.kubeClient.CoreV1().Nodes().Get(context.TODO(), node, metav1.GetOptions{}); err != nil && apierrors.IsNotFound(err) {
logrus.Infof("Node %v does not exist, skipping cleanup of volume %v", node, pv.Name)
return nil
}

Copy link
Contributor Author

@marcusramberg marcusramberg Jul 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems reasonable to me, I'll update the PR. It's imported in there as k8serror tho.

@marcusramberg marcusramberg force-pushed the marcus/ephemeral_fix branch from 1e1388b to bdf05c2 Compare July 31, 2025 07:22
@marcusramberg marcusramberg requested a review from derekbit July 31, 2025 14:34
Copy link
Member

@derekbit derekbit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM @marcusramberg. Thanks for your contribution. The improvement will be in v0.0.33 that is scheduled in Oct

@derekbit derekbit merged commit e703098 into rancher:master Jul 31, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants