Skip to content

Introduce RetryOnFailure lifecycle management strategy #1281

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

matheuscscp
Copy link
Member

@matheuscscp matheuscscp commented Aug 8, 2025

Closes: #1278

TODO:

  • Automatic tests.

This PR implements the API enhancements described in #1278.

I tested this thoroughly, including in the scenario of pods in ImagePullBackOff due to unexisting image tags. When the tags are pushed, the HelmRelease object eventually converges without manual intervention.

The behavior of the CLI commands flux reconcile hr and flux reconcile hr --force is the same for a HelmRelease in the failed state: a release action is attempted. If the HelmRelease is Ready, i.e. in the deployed state, the behaviors of these two commands are not the same, but they match the behaviors for a HelmRelease object that is not using this feature. In this case, the behavior is no-op for flux reconcile hr, and perform a release action for flux reconcile hr --force.

Because the retry strategies cause a time-based requeue (and do not end in a terminal state), running the flux reconcile hr command when the release is failed, with or without --force, results in an immediate release action, and does not remove any in-flight RequeueAfter operations. In particular, if an in-flight RequeueAfter completes during a reconciliation, another reconciliation will run immediately after. I observed this many times due to using a very short .retryInterval for speeding up my tests.

Another common behavior that I noticed while testing the install strategy is that, when the source object (e.g. OCIRepository) is created together with the HelmRelease, two reconciliations are processed: the one produced by "requeue dependency" and the watch event from the source object becoming Ready. This results in the first retry happening immediately after the first attempt, i.e. in two consecutive failed installs.

It's also important to note that using the install strategy causes the transition of .status.lastAttemptedReleaseAction from install to upgrade. So if this upgrade fails, the upgrade strategy configuration will now be the one driving the behavior, the install strategy configuration will no longer be at play. Therefore, to successfully achieve a HelmRelease that is automatically retried without any remediations, the spec must look like this:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
spec:
  install:
    strategy: # the install strategy will cause the first retry, which will be an upgrade (the first one)
      name: RetryOnFailure
  upgrade:
    strategy: # the upgrade strategy will cause the subsequent retries, which will all be upgrades
      name: RetryOnFailure

@matheuscscp matheuscscp added enhancement New feature or request area/ux In pursuit of a delightful user experience labels Aug 8, 2025
@matheuscscp matheuscscp force-pushed the upgrade-retry-on-failure branch 4 times, most recently from ab07aa4 to 3496a8c Compare August 9, 2025 01:41
Copy link
Member

@stefanprodan stefanprodan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest we apply jitter for the retries to spread out the reconciliations

@matheuscscp matheuscscp force-pushed the upgrade-retry-on-failure branch 2 times, most recently from c8eea39 to fa2498f Compare August 9, 2025 14:29
@matheuscscp matheuscscp changed the title upgrade: Add support for retrying automatically without remediations Introduce RetryOnFailure lifecycle management strategy Aug 9, 2025
@matheuscscp matheuscscp force-pushed the upgrade-retry-on-failure branch from fa2498f to d7540ec Compare August 9, 2025 21:33
Copy link
Member Author

@matheuscscp matheuscscp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After deeply understanding the state machine, it turns out that it's simpler to introduce both action strategies in the same PR.

remediation := req.Object.GetActiveRemediation()
if remediation == nil || !remediation.RetriesExhausted(req.Object) {
conditions.MarkReconciling(req.Object, meta.ProgressingWithRetryReason, "%s", conditions.GetMessage(req.Object, meta.ReadyCondition))
return ErrMustRequeue
}
// Check if retries have exhausted after remediation for early
// stall condition detection.
if remediation != nil && remediation.RetriesExhausted(req.Object) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This check is a tautology, it's basically the negation of the check right above. I removed it in this separate commit: 079ae1b

Signed-off-by: Matheus Pimenta <[email protected]>
@matheuscscp matheuscscp force-pushed the upgrade-retry-on-failure branch 6 times, most recently from b6d1e3c to 2bb4699 Compare August 20, 2025 15:24
@@ -266,6 +278,14 @@ func (r *AtomicRelease) Reconcile(ctx context.Context, req *Request) error {
// Run the action sub-reconciler.
log.Info(fmt.Sprintf("running '%s' action with timeout of %s", next.Name(), timeoutForAction(next, req.Object).String()))
if err = next.Reconcile(ctx, req); err != nil {
if retry := req.Object.GetActiveRetry(); retry != nil {
log.Error(err, fmt.Sprintf("failed to run '%s' action", next.Name()))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, this error was already recorded in a condition and was sent in Kubernetes and notification-controller events. But no logs show up because we are handling it here and returning a different error, so I'm logging it here

@matheuscscp
Copy link
Member Author

matheuscscp commented Aug 20, 2025

I tested this PR in the following scenarios:

  1. Chart build fails due to templating error. In this case, the HelmReleaseReconciler was returning the error to controller-runtime, resulting in retries with exponential backoff. To fix this I applied this diff.
  2. Build works, but YAML apply fails. I tested this by making the chart apply a resource in a namespace it didn't have RBAC for.
  3. Apply succeeds but Deployment fails due to image pull error.
  4. Apply succeeds and Deployment becomes Ready, but Helm tests fail.
  5. Install succeeds, then I delete the Helm storage secret. This works as expected, a Helm install is performed in the next reconciliation and succeeds.

In all the cases above, after fixing the issue in scenario 1, the controller behavior matches the expectations:

  • Except for scenario 5, retries with upgrade after the respective .retryInterval.
  • A status condition is updated with the error.
  • A log is emitted with the error (after applying this diff).
  • Kubernetes and notification-controller events are sent with the error.

@matheuscscp matheuscscp force-pushed the upgrade-retry-on-failure branch from 2bb4699 to 0fd558a Compare August 20, 2025 16:55
@matheuscscp matheuscscp force-pushed the upgrade-retry-on-failure branch from 0fd558a to c053726 Compare August 21, 2025 12:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ux In pursuit of a delightful user experience enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Introduce RetryOnFailure lifecycle management strategy
2 participants