| 
 | 1 | +---  | 
 | 2 | +title: do-not-block-on-degraded-true-clusteroperators  | 
 | 3 | +authors:  | 
 | 4 | +  - "@wking"  | 
 | 5 | +reviewers:  | 
 | 6 | +  - "@PratikMahajan, update team lead"  | 
 | 7 | +  - "@sdodson, update staff engineer"  | 
 | 8 | +approvers:  | 
 | 9 | +  - "@PratikMahajan, update team lead"  | 
 | 10 | +api-approvers:  | 
 | 11 | +  - None  | 
 | 12 | +creation-date: 2024-11-25  | 
 | 13 | +last-updated: 2024-12-03  | 
 | 14 | +tracking-link:  | 
 | 15 | +  - https://issues.redhat.com/browse/OTA-540  | 
 | 16 | +---  | 
 | 17 | + | 
 | 18 | +# Do not block on Degraded=True ClusterOperators  | 
 | 19 | + | 
 | 20 | +## Summary  | 
 | 21 | + | 
 | 22 | +The cluster-version operator (CVO) uses an update-mode when transitioning between releases, where the manifest operands are [sorted into a task-node graph](/dev-guide/cluster-version-operator/user/reconciliation.md#manifest-graph), and the CVO walks the graph reconciling.  | 
 | 23 | +Since 4.1, the cluster-version operator has blocked during update and reconcile modes (but not during install mode) on `Degraded=True` ClusterOperator.  | 
 | 24 | +This enhancement proposes ignoring `Degraded` when deciding whether to block on a ClusterOperator manifest.  | 
 | 25 | + | 
 | 26 | +## Motivation  | 
 | 27 | + | 
 | 28 | +The goal of blocking on manifests with sad resources is to avoid further destabilization.  | 
 | 29 | +For example, if we have not reconciled a namespace manifest or ServiceAccount RoleBinding, there's no point in trying to update the consuming operator Deployment.  | 
 | 30 | +Or if we are unable to update the Kube-API-server operator, we don't want to inject [unsupported kubelet skew][kubelet-skew] by asking the machine-config operator to update nodes.  | 
 | 31 | + | 
 | 32 | +However, blocking the update on a sad resource has the downside that later manifest-graph task-nodes are not reconciled, while the CVO waits for the sad resource to return to happiness.  | 
 | 33 | +We maximize safety by blocking when progress would be risky, while continuing when progress would be safe, and possibly helpful.  | 
 | 34 | + | 
 | 35 | +Our expirience with `Degraded=True` blocks turns up cases like:  | 
 | 36 | + | 
 | 37 | +* 4.6 `Degraded=True` on an unreachable, user-provided node, with monitoring reporting `UpdatingnodeExporterFailed`, network reporting `RolloutHung`, and machine-config reporting `MachineConfigDaemonFailed`.  | 
 | 38 | +  But those ClusterOperator were all still `Available=True`, and in 4.10 and later, monitoring workloads are guarded by PodDisruptionBudgets (PDBs)  | 
 | 39 | + | 
 | 40 | +### User Stories  | 
 | 41 | + | 
 | 42 | +* As a cluster administrator, I want the ability to defer recovering `Degraded=True` ClusterOperators without slowing ClusterVersion updates.  | 
 | 43 | + | 
 | 44 | +### Goals  | 
 | 45 | + | 
 | 46 | +ClusterVersion updates will no longer block on ClusterOperators solely based on `Degraded=True`.  | 
 | 47 | + | 
 | 48 | +### Non-Goals  | 
 | 49 | + | 
 | 50 | +* Adjusting how the cluster-version operator treats `Available` and `versions` in ClusterOperator status.  | 
 | 51 | +  The CVO will still block on `Available=False` ClusterOperator, and will also still block on `status.versions` reported in the ClusterOperator's release manifest.  | 
 | 52 | + | 
 | 53 | +* Adjusting whether `Degraded` ClusterOperator conditions propagated through to the ClusterVersion `Failing` condition.  | 
 | 54 | +  As with the current install mode, the sad condition will be propagated through to `Failing=True`, unless outweighed by a more serious condition like `Available=False`.  | 
 | 55 | + | 
 | 56 | +## Proposal  | 
 | 57 | + | 
 | 58 | +The cluster-version operator currently has [a mode switch][cvo-degraded-mode-switch] that makes `Degraded` ClusterOperator a non-blocking condition that is still proagated through to `Failing`.  | 
 | 59 | +This enhancement proposes making that an unconditional `UpdateEffectReport`, regardless of the CVO's current mode (installing, updating, reconciling, etc.).  | 
 | 60 | + | 
 | 61 | +### Workflow Description  | 
 | 62 | + | 
 | 63 | +Cluster administrators will be largely unaware of this feature.  | 
 | 64 | +They will no longer have ClusterVersion update progress slowed by `Degraded=True` ClusterOperators, so there will be less admin involvement there.  | 
 | 65 | +They will continue to be notified of `Degraded=True` ClusterOperators via [the `warning` `ClusterOperatorDegraded` alert][ClusterOperatorDegraded] and the `Failing=True` ClusterVersion condition.  | 
 | 66 | + | 
 | 67 | +### API Extensions  | 
 | 68 | + | 
 | 69 | +No API extensions are needed for this proposal.  | 
 | 70 | + | 
 | 71 | +### Topology Considerations  | 
 | 72 | + | 
 | 73 | +#### Hypershift / Hosted Control Planes  | 
 | 74 | + | 
 | 75 | +HyperShift's ClusterOperator context is the same as standalone, so it will receive the same benefits from the same cluster-version operator code change, and does not need special consideration.  | 
 | 76 | + | 
 | 77 | +#### Standalone Clusters  | 
 | 78 | + | 
 | 79 | +Yes, the enhancement is expected to improve the update experience on standalone, by decoupling ClusterVersion update completion from recovering `Degraded=True` ClusterOperators, granting the cluster administrator the flexibility to address update speed and operator degradation independently.  | 
 | 80 | + | 
 | 81 | +#### Single-node Deployments or MicroShift  | 
 | 82 | + | 
 | 83 | +Single-node's ClusterOperator context is the same as standalone, so it will receive the same benefits from the same cluster-version operator code change, and does not need special consideration.  | 
 | 84 | +This change is a minor tweak to existing CVO code, so it is not expected to impact resource consumption.  | 
 | 85 | + | 
 | 86 | +MicroShift updates are managed via RPMs, without a cluster-version operator, so it is not exposed to the ClusterVersion updates this enhancement is refining, and not affected by the changes proposed in this enhancement.  | 
 | 87 | + | 
 | 88 | +### Implementation Details/Notes/Constraints  | 
 | 89 | + | 
 | 90 | +The code change is expected to be a handful of lines, as discussed in [the *Proposal* section](#proposal), so there are no further implementation details needed.  | 
 | 91 | + | 
 | 92 | +### Risks and Mitigations  | 
 | 93 | + | 
 | 94 | +The risk would be that there are some ClusterOperators who currently rely on the cluster-version operator blocking during updates on ClusterOperators that are `Available=True`, `Degraded=True`, and which set the release manifest's expected `versions`.  | 
 | 95 | +As discussed in [the *Motivation* section](#motivation), we're not currently aware of any such ClusterOperators.  | 
 | 96 | +If any turn up, we can mitigate by [declaring conditional update risks](targeted-update-edge-blocking.md) using the existing `cluster_operator_conditions{condition="Degraded"}` PromQL metric, while teaching the relevant operators to set `Available=False` and/or without their `versions` bumps until the issue that needs to block further ClusterVersion update progress has been resolved.  | 
 | 97 | + | 
 | 98 | +How will security be reviewed and by whom?  | 
 | 99 | +Unclear.  Feedback welcome.  | 
 | 100 | + | 
 | 101 | +How will UX be reviewed and by whom?  | 
 | 102 | +Unclear.  Feedback welcome.  | 
 | 103 | + | 
 | 104 | +### Drawbacks  | 
 | 105 | + | 
 | 106 | +As discussed in [the *Risks* section](#risks-and-mitigations), the main drawback is changing behavior that we've had in place for many years.  | 
 | 107 | +But we do not expect much customer pushback based on "hey, my update completed?!  I expected it to stick on this sad component...".  | 
 | 108 | +We do expect it to reduce customer frustration when they want the update to complete, but for reasons like administrative siloes do no have the ability to recover a component from minor degradation themselves.  | 
 | 109 | + | 
 | 110 | +## Test Plan  | 
 | 111 | + | 
 | 112 | +**Note:** *Section not required until targeted at a release.*  | 
 | 113 | + | 
 | 114 | +Consider the following in developing a test plan for this enhancement:  | 
 | 115 | +- Will there be e2e and integration tests, in addition to unit tests?  | 
 | 116 | +- How will it be tested in isolation vs with other components?  | 
 | 117 | +- What additional testing is necessary to support managed OpenShift service-based offerings?  | 
 | 118 | + | 
 | 119 | +No need to outline all of the test cases, just the general strategy. Anything  | 
 | 120 | +that would count as tricky in the implementation and anything particularly  | 
 | 121 | +challenging to test should be called out.  | 
 | 122 | + | 
 | 123 | +All code is expected to have adequate tests (eventually with coverage  | 
 | 124 | +expectations).  | 
 | 125 | + | 
 | 126 | +## Graduation Criteria  | 
 | 127 | + | 
 | 128 | +There are no API changes proposed by this enhancement, which only affects sad-path handling, so we expect the code change to go straight to the next generally-available release, without feature gating or staged graduation.  | 
 | 129 | + | 
 | 130 | +### Dev Preview -> Tech Preview  | 
 | 131 | + | 
 | 132 | +Not applicable.  | 
 | 133 | + | 
 | 134 | +### Tech Preview -> GA  | 
 | 135 | + | 
 | 136 | +Not applicable.  | 
 | 137 | + | 
 | 138 | +### Removing a deprecated feature  | 
 | 139 | + | 
 | 140 | +Not applicable.  | 
 | 141 | + | 
 | 142 | +## Upgrade / Downgrade Strategy  | 
 | 143 | + | 
 | 144 | +This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatability issues.  | 
 | 145 | + | 
 | 146 | +## Version Skew Strategy  | 
 | 147 | + | 
 | 148 | +This enhancement only affects the cluster-version operator's internal processing of longstanding ClusterOperator APIs, so there are no skew or compatability issues.  | 
 | 149 | + | 
 | 150 | +## Operational Aspects of API Extensions  | 
 | 151 | + | 
 | 152 | +There are no API changes proposed by this enhancement.  | 
 | 153 | + | 
 | 154 | +## Support Procedures  | 
 | 155 | + | 
 | 156 | +This enhancement is a small pivot in how the cluster-version operator processes ClusterOperator manifests during updates.  | 
 | 157 | +As discussed in [the *Drawbacks* section](#drawbacks), we do not expect cluster admins open support cases related to this change.  | 
 | 158 | + | 
 | 159 | +## Alternatives  | 
 | 160 | + | 
 | 161 | +We could continue with the current approach, and absorb the occasional friction it causes.  | 
 | 162 | + | 
 | 163 | +## Infrastructure Needed  | 
 | 164 | + | 
 | 165 | +No additional infrastructure is needed for this enhancement.  | 
 | 166 | + | 
 | 167 | +[ClusterOperatorDegraded]: https://github.com/openshift/cluster-version-operator/blob/820b74aa960717aae5431f783212066736806785/install/0000_90_cluster-version-operator_02_servicemonitor.yaml#L106-L124  | 
 | 168 | +[cvo-degraded-mode-switch]: https://github.com/openshift/cluster-version-operator/blob/820b74aa960717aae5431f783212066736806785/pkg/cvo/internal/operatorstatus.go#L241-L245  | 
 | 169 | +[kubelet-skew]: https://kubernetes.io/releases/version-skew-policy/#kubelet  | 
0 commit comments