-
Notifications
You must be signed in to change notification settings - Fork 190
Introduce RetryOnFailure
lifecycle management strategy
#1281
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
ab07aa4
to
3496a8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest we apply jitter for the retries to spread out the reconciliations
c8eea39
to
fa2498f
Compare
RetryOnFailure
lifecycle management strategy
fa2498f
to
d7540ec
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After deeply understanding the state machine, it turns out that it's simpler to introduce both action strategies in the same PR.
remediation := req.Object.GetActiveRemediation() | ||
if remediation == nil || !remediation.RetriesExhausted(req.Object) { | ||
conditions.MarkReconciling(req.Object, meta.ProgressingWithRetryReason, "%s", conditions.GetMessage(req.Object, meta.ReadyCondition)) | ||
return ErrMustRequeue | ||
} | ||
// Check if retries have exhausted after remediation for early | ||
// stall condition detection. | ||
if remediation != nil && remediation.RetriesExhausted(req.Object) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is a tautology, it's basically the negation of the check right above. I removed it in this separate commit: 079ae1b
Signed-off-by: Matheus Pimenta <[email protected]>
b6d1e3c
to
2bb4699
Compare
@@ -266,6 +278,14 @@ func (r *AtomicRelease) Reconcile(ctx context.Context, req *Request) error { | |||
// Run the action sub-reconciler. | |||
log.Info(fmt.Sprintf("running '%s' action with timeout of %s", next.Name(), timeoutForAction(next, req.Object).String())) | |||
if err = next.Reconcile(ctx, req); err != nil { | |||
if retry := req.Object.GetActiveRetry(); retry != nil { | |||
log.Error(err, fmt.Sprintf("failed to run '%s' action", next.Name())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, this error was already recorded in a condition and was sent in Kubernetes and notification-controller events. But no logs show up because we are handling it here and returning a different error, so I'm logging it here
I tested this PR in the following scenarios:
In all the cases above, after fixing the issue in scenario 1, the controller behavior matches the expectations:
|
2bb4699
to
0fd558a
Compare
Signed-off-by: Matheus Pimenta <[email protected]>
0fd558a
to
c053726
Compare
Closes: #1278
TODO:
This PR implements the API enhancements described in #1278.
I tested this thoroughly, including in the scenario of pods in
ImagePullBackOff
due to unexisting image tags. When the tags are pushed, the HelmRelease object eventually converges without manual intervention.The behavior of the CLI commands
flux reconcile hr
andflux reconcile hr --force
is the same for a HelmRelease in thefailed
state: a release action is attempted. If the HelmRelease is Ready, i.e. in thedeployed
state, the behaviors of these two commands are not the same, but they match the behaviors for a HelmRelease object that is not using this feature. In this case, the behavior is no-op forflux reconcile hr
, and perform a release action forflux reconcile hr --force
.Because the retry strategies cause a time-based requeue (and do not end in a terminal state), running the
flux reconcile hr
command when the release isfailed
, with or without--force
, results in an immediate release action, and does not remove any in-flightRequeueAfter
operations. In particular, if an in-flightRequeueAfter
completes during a reconciliation, another reconciliation will run immediately after. I observed this many times due to using a very short.retryInterval
for speeding up my tests.Another common behavior that I noticed while testing the install strategy is that, when the source object (e.g. OCIRepository) is created together with the HelmRelease, two reconciliations are processed: the one produced by "requeue dependency" and the watch event from the source object becoming Ready. This results in the first retry happening immediately after the first attempt, i.e. in two consecutive failed installs.
It's also important to note that using the install strategy causes the transition of
.status.lastAttemptedReleaseAction
frominstall
toupgrade
. So if this upgrade fails, the upgrade strategy configuration will now be the one driving the behavior, the install strategy configuration will no longer be at play. Therefore, to successfully achieve a HelmRelease that is automatically retried without any remediations, the spec must look like this: