Introduce `RetryOnFailure` lifecycle management strategy #1281

matheuscscp · 2025-08-08T23:39:11Z

Closes: #1278

TODO:

Automatic tests.

This PR implements the API enhancements described in #1278.

I tested this thoroughly, including in the scenario of pods in ImagePullBackOff due to unexisting image tags. When the tags are pushed, the HelmRelease object eventually converges without manual intervention.

The behavior of the CLI commands flux reconcile hr and flux reconcile hr --force is the same for a HelmRelease in the failed state: a release action is attempted. If the HelmRelease is Ready, i.e. in the deployed state, the behaviors of these two commands are not the same, but they match the behaviors for a HelmRelease object that is not using this feature. In this case, the behavior is no-op for flux reconcile hr, and perform a release action for flux reconcile hr --force.

Because the retry strategies cause a time-based requeue (and do not end in a terminal state), running the flux reconcile hr command when the release is failed, with or without --force, results in an immediate release action, and does not remove any in-flight RequeueAfter operations. In particular, if an in-flight RequeueAfter completes during a reconciliation, another reconciliation will run immediately after. I observed this many times due to using a very short .retryInterval for speeding up my tests.

Another common behavior that I noticed while testing the install strategy is that, when the source object (e.g. OCIRepository) is created together with the HelmRelease, two reconciliations are processed: the one produced by "requeue dependency" and the watch event from the source object becoming Ready. This results in the first retry happening immediately after the first attempt, i.e. in two consecutive failed installs.

It's also important to note that using the install strategy causes the transition of .status.lastAttemptedReleaseAction from install to upgrade. So if this upgrade fails, the upgrade strategy configuration will now be the one driving the behavior, the install strategy configuration will no longer be at play. Therefore, to successfully achieve a HelmRelease that is automatically retried without any remediations, the spec must look like this:

apiVersion: helm.toolkit.fluxcd.io/v2
kind: HelmRelease
spec:
  install:
    strategy: # the install strategy will cause the first retry, which will be an upgrade (the first one)
      name: RetryOnFailure
  upgrade:
    strategy: # the upgrade strategy will cause the subsequent retries, which will all be upgrades
      name: RetryOnFailure

stefanprodan

I suggest we apply jitter for the retries to spread out the reconciliations

internal/controller/helmrelease_controller.go

matheuscscp

After deeply understanding the state machine, it turns out that it's simpler to introduce both action strategies in the same PR.

matheuscscp · 2025-08-09T21:37:20Z

internal/reconcile/atomic_release.go

 				remediation := req.Object.GetActiveRemediation()
 				if remediation == nil || !remediation.RetriesExhausted(req.Object) {
 					conditions.MarkReconciling(req.Object, meta.ProgressingWithRetryReason, "%s", conditions.GetMessage(req.Object, meta.ReadyCondition))
 					return ErrMustRequeue
 				}
-				// Check if retries have exhausted after remediation for early
-				// stall condition detection.
-				if remediation != nil && remediation.RetriesExhausted(req.Object) {


This check is a tautology, it's basically the negation of the check right above. I removed it in this separate commit: 079ae1b

Signed-off-by: Matheus Pimenta <[email protected]>

matheuscscp · 2025-08-20T15:27:35Z

internal/reconcile/atomic_release.go

@@ -266,6 +278,14 @@ func (r *AtomicRelease) Reconcile(ctx context.Context, req *Request) error {
 			// Run the action sub-reconciler.
 			log.Info(fmt.Sprintf("running '%s' action with timeout of %s", next.Name(), timeoutForAction(next, req.Object).String()))
 			if err = next.Reconcile(ctx, req); err != nil {
+				if retry := req.Object.GetActiveRetry(); retry != nil {
+					log.Error(err, fmt.Sprintf("failed to run '%s' action", next.Name()))


Here, this error was already recorded in a condition and was sent in Kubernetes and notification-controller events. But no logs show up because we are handling it here and returning a different error, so I'm logging it here

matheuscscp · 2025-08-20T15:58:49Z

I tested this PR in the following scenarios:

Chart build fails due to templating error. In this case, the HelmReleaseReconciler was returning the error to controller-runtime, resulting in retries with exponential backoff. To fix this I applied this diff.
Build works, but YAML apply fails. I tested this by making the chart apply a resource in a namespace it didn't have RBAC for.
Apply succeeds but Deployment fails due to image pull error.
Apply succeeds and Deployment becomes Ready, but Helm tests fail.
Install succeeds, then I delete the Helm storage secret. This works as expected, a Helm install is performed in the next reconciliation and succeeds.

In all the cases above, after fixing the issue in scenario 1, the controller behavior matches the expectations:

Except for scenario 5, retries with upgrade after the respective .retryInterval.
A status condition is updated with the error.
A log is emitted with the error (after applying this diff).
Kubernetes and notification-controller events are sent with the error.

Signed-off-by: Matheus Pimenta <[email protected]>

matheuscscp requested a review from stefanprodan August 8, 2025 23:39

matheuscscp added enhancement New feature or request area/ux In pursuit of a delightful user experience labels Aug 8, 2025

matheuscscp force-pushed the upgrade-retry-on-failure branch 4 times, most recently from ab07aa4 to 3496a8c Compare August 9, 2025 01:41

stefanprodan reviewed Aug 9, 2025

View reviewed changes

internal/controller/helmrelease_controller.go Outdated Show resolved Hide resolved

matheuscscp force-pushed the upgrade-retry-on-failure branch 2 times, most recently from c8eea39 to fa2498f Compare August 9, 2025 14:29

matheuscscp changed the title ~~upgrade: Add support for retrying automatically without remediations~~ Introduce RetryOnFailure lifecycle management strategy Aug 9, 2025

matheuscscp force-pushed the upgrade-retry-on-failure branch from fa2498f to d7540ec Compare August 9, 2025 21:33

matheuscscp commented Aug 9, 2025

View reviewed changes

matheuscscp mentioned this pull request Aug 9, 2025

Introduce RetryOnFailure lifecycle management strategy #1278

Open

Remove tautology

ee651bf

Signed-off-by: Matheus Pimenta <[email protected]>

matheuscscp force-pushed the upgrade-retry-on-failure branch 6 times, most recently from b6d1e3c to 2bb4699 Compare August 20, 2025 15:24

matheuscscp commented Aug 20, 2025

View reviewed changes

matheuscscp force-pushed the upgrade-retry-on-failure branch from 2bb4699 to 0fd558a Compare August 20, 2025 16:55

Introduce RetryOnFailure lifecycle management strategy

c053726

Signed-off-by: Matheus Pimenta <[email protected]>

matheuscscp force-pushed the upgrade-retry-on-failure branch from 0fd558a to c053726 Compare August 21, 2025 12:31

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Introduce `RetryOnFailure` lifecycle management strategy #1281

Introduce `RetryOnFailure` lifecycle management strategy #1281

Uh oh!

matheuscscp commented Aug 8, 2025 •

edited

Loading

Uh oh!

stefanprodan left a comment

Uh oh!

Uh oh!

matheuscscp left a comment •

edited

Loading

Uh oh!

matheuscscp Aug 9, 2025

Uh oh!

matheuscscp Aug 20, 2025

Uh oh!

matheuscscp commented Aug 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

Introduce RetryOnFailure lifecycle management strategy #1281

Are you sure you want to change the base?

Introduce RetryOnFailure lifecycle management strategy #1281

Uh oh!

Conversation

matheuscscp commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stefanprodan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

matheuscscp left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matheuscscp Aug 9, 2025

Choose a reason for hiding this comment

Uh oh!

matheuscscp Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

matheuscscp commented Aug 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Introduce `RetryOnFailure` lifecycle management strategy #1281

Introduce `RetryOnFailure` lifecycle management strategy #1281

matheuscscp commented Aug 8, 2025 •

edited

Loading

matheuscscp left a comment •

edited

Loading

matheuscscp commented Aug 20, 2025 •

edited

Loading