Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: drain and volume detachment status conditions #1876

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

jmdeal
Copy link
Member

@jmdeal jmdeal commented Dec 11, 2024

Fixes #N/A

Description
Adds status conditions for node drain and volume detachment to improve observability for the individual termination stages. This is a scoped down version of #1837, which takes these changes along with splitting each termination stage into a separate controller. I will continue to work on that refactor, but I'm decoupling to work on higher priority work.

Status Conditions:

Condition Unknown False True
Drained Karpenter hasn't attempted to drain the Node yet or the node is currently draining (marked with reason) N/A Karpenter has successfully drained the node and will proceed with the termination flow.
VolumesDetached Karpenter hasn't checked for volume attachments yet or there are volume attachments which are currently blocking termination. This won't transition out of unknown until Drained transitions to true. There are blocking volume attachments, but the nodeclaim's TerminationGracePeriod has elapsed. This state is only possible if TGP is configured. All blocking volume attachment objects have been deleted.

How was this change tested?
make presubmit

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Dec 11, 2024
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Dec 11, 2024
@jmdeal jmdeal force-pushed the feat/termination-conditions branch from 4bb4d97 to fb3ac47 Compare December 11, 2024 20:31
@coveralls
Copy link

coveralls commented Dec 11, 2024

Pull Request Test Coverage Report for Build 13248558350

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 82 of 136 (60.29%) changed or added relevant lines in 4 files are covered.
  • 14 unchanged lines in 3 files lost coverage.
  • Overall coverage decreased (-0.3%) to 81.029%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/utils/node/node.go 0 5 0.0%
pkg/controllers/node/termination/controller.go 71 120 59.17%
Files with Coverage Reduction New Missed Lines %
pkg/controllers/node/termination/controller.go 3 64.06%
pkg/controllers/disruption/consolidation.go 4 88.55%
pkg/controllers/provisioning/scheduling/preferences.go 7 86.52%
Totals Coverage Status
Change from base Build 13211152753: -0.3%
Covered Lines: 9179
Relevant Lines: 11328

💛 - Coveralls

@engedaam
Copy link
Contributor

/assign @engedaam

Copy link

github-actions bot commented Jan 2, 2025

This PR has been inactive for 14 days. StaleBot will close this stale PR after 14 more days of inactivity.

@github-actions github-actions bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 2, 2025
@jmdeal
Copy link
Member Author

jmdeal commented Jan 11, 2025

/remove-lifecycle stale

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 11, 2025
@jmdeal jmdeal force-pushed the feat/termination-conditions branch 2 times, most recently from b527992 to 21176e1 Compare January 15, 2025 20:02
@jmdeal
Copy link
Member Author

jmdeal commented Jan 15, 2025

/hold

@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 15, 2025
@engedaam
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
Copy link
Contributor

@engedaam engedaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: engedaam, jmdeal
Once this PR has been reviewed and has the lgtm label, please assign jonathan-innis for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 15, 2025
@jmdeal jmdeal force-pushed the feat/termination-conditions branch from d536a96 to 43949ef Compare January 16, 2025 17:21
Copy link
Contributor

@engedaam engedaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 16, 2025
})
if !nodeClaim.StatusConditions().Get(v1.ConditionTypeDrained).IsTrue() {
stored := nodeClaim.DeepCopy()
_ = nodeClaim.StatusConditions().SetTrue(v1.ConditionTypeDrained)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use the modified check here as well? Would probably work just as well as the IsTrue check

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't because we knew it always would be and this limited the scope of stored, I can go either way though. The alternative would be:

stored := nodeClaim.DeepCopy()
if nodeClaim.StatusConditions().SetTrue(v1.ConditionTypeDrained) {
  // remaining logic
}

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't really think it tangibly matters -- good either way, just thought it was odd that we didn't keep consistent style here

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe throw a comment over this to explain why it's coded this way? I found myself guessing about it again as I was reading through it -- had to re-read the context to understand that this is because of the patch

// 404 = the nodeClaim no longer exists
if errors.IsNotFound(err) {
continue
if volumesDetached {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a way to cleanup some of this logic so that it's not so nested -- comments might also help too -- the fact that we fall-through when volumes are detached and are able to get to the bottom of the function (we don't requeue) is a bit confusing IMO

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've messed around with a few different ways of structuring this, but this ends up being the most readable IMO. I will add some comments to call out the fall-through behavior. FWIW, this does clean up in my larger refactor PR.

@jmdeal jmdeal force-pushed the feat/termination-conditions branch from 43949ef to 4ed938c Compare January 27, 2025 05:33
@k8s-ci-robot
Copy link
Contributor

New changes are detected. LGTM label has been removed.

@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jan 27, 2025
@jmdeal jmdeal force-pushed the feat/termination-conditions branch from 4ed938c to 015140d Compare February 10, 2025 19:05
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Feb 10, 2025
@jmdeal jmdeal force-pushed the feat/termination-conditions branch 2 times, most recently from f4ba729 to 233f43a Compare February 10, 2025 19:15
@jmdeal
Copy link
Member Author

jmdeal commented Feb 10, 2025

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Feb 10, 2025
@jmdeal jmdeal force-pushed the feat/termination-conditions branch from 233f43a to 1ddb121 Compare February 11, 2025 01:39
Copy link
Member

@jonathan-innis jonathan-innis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close to me! Just a few small things! Nice work!

@@ -64,6 +64,7 @@ func IgnoreNodeClaimNotFoundError(err error) error {
// DuplicateNodeClaimError is an error returned when multiple v1.NodeClaims are found matching the passed providerID
type DuplicateNodeClaimError struct {
ProviderID string
NodeClaims []string
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
NodeClaims []string
NodeClaimNames []string

@@ -60,6 +60,7 @@ func TestAPIs(t *testing.T) {
}

var _ = BeforeSuite(func() {
fakeClock = clock.NewFakeClock(time.Now())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you add the fakeClock here -- doesn't look like it is being used anywhere in this testing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's passed down to the terminator and was causing nil panics. We could probably scope it down and just pass it there, but I don't mind keeping it test global so it's obvious to use the fake clock passed to the terminator rather than a new one in any future tests that use it.

})
if !nodeClaim.StatusConditions().Get(v1.ConditionTypeDrained).IsTrue() {
stored := nodeClaim.DeepCopy()
_ = nodeClaim.StatusConditions().SetTrue(v1.ConditionTypeDrained)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe throw a comment over this to explain why it's coded this way? I found myself guessing about it again as I was reading through it -- had to re-read the context to understand that this is because of the patch

// 404 = the nodeClaim no longer exists
if errors.IsNotFound(err) {
continue
if len(pendingVolumeAttachments) == 0 {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given how large this volume attachment code is now, what do you think about moving this into a separate function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been resistant to that because I find the requeue logic to be significantly harder to follow as soon as you move it into a separate function - I had a similar issue when I tried to structure this as subreconcilers. I had been punting those architectural changes for my larger refactor PR. If you don't feel strongly about it I'd rather punt for that, since I think the longer reconcile function is a better than obfuscating the control flow, but let me know what you think.

if errors.IsConflict(err) {
return reconcile.Result{Requeue: true}, nil
}
return reconcile.Result{}, fmt.Errorf("ensuring instance termination, %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't seem to handle the NotFound error that we get back from this EnsureTerminated call in the same way as we do elsewhere here

@@ -908,3 +957,13 @@ func ExpectNodeWithNodeClaimDraining(c client.Client, nodeName string) *corev1.N
Expect(node.DeletionTimestamp).ToNot(BeNil())
return node
}

func ExpectNotRequeued(result reconcile.Result) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are two functions that I think could be generally useful and I would consider putting in operatorpkg

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call, I've opened a PR: awslabs/operatorpkg#130

@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 21, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants