From 77a487a73975a66dc388c483a325b9c9e90378bb Mon Sep 17 00:00:00 2001 From: BenMotts Date: Fri, 12 Dec 2025 15:12:48 +0000 Subject: [PATCH 1/2] K8S-4394: Rewrite upgrade concepts and docs so they cover the changes to upgrade controls in 2.9 --- modules/ROOT/pages/concept-upgrade.adoc | 340 +++--------------- .../ROOT/pages/howto-couchbase-upgrade.adoc | 179 ++++++++- 2 files changed, 221 insertions(+), 298 deletions(-) diff --git a/modules/ROOT/pages/concept-upgrade.adoc b/modules/ROOT/pages/concept-upgrade.adoc index 2358c4d3..acc8d30c 100644 --- a/modules/ROOT/pages/concept-upgrade.adoc +++ b/modules/ROOT/pages/concept-upgrade.adoc @@ -6,341 +6,119 @@ This includes upgrading the Couchbase Server version and also related Kubernetes == Upgrading Couchbase Server -The Couchbase Server version can be xref:howto-couchbase-upgrade.adoc[upgraded] by the Operator. +The Couchbase Server version can be xref:howto-couchbase-upgrade.adoc[upgraded] by the Operator. To upgrade a Couchbase Cluster, change the couchbasecluster.spec.image field in the cluster manifest to the version you wish to upgrade to. The Operator will then handle upgrading the pods running Couchbase Server which make up the Couchbase cluster. During an upgrade, the cluster version can also be rolled back to the previous version by reverting the image field change. The Operator will then perform the upgrade effectively in reverse, following the configured upgrade controls but for pods on the old version. -Upgrades may be performed using one of a number of different strategies as specified by xref:resource/couchbasecluster.adoc#couchbaseclusters-spec-upgradestrategy[`couchbaseclusters.spec.upgradeStrategy`] in the `CouchbaseCluster` resource: +For more granular control, the upgrade process can be managed and controlled using `spec.upgrade`` in the CouchbaseCluster resource. -* InPlaceUpgrade performs an in-place upgrade on one, or more, pods at a time -* Rolling upgrades upgrade one, or more, pods at a time -* Immediate upgrades upgrade all pods at the same time +[source,yaml] +---- +upgrade: + upgradeProcess: SwapRebalance # <.> + upgradeStrategy: RollingUpgrade # <.> + rollingUpgrade: # <.> + maxUpgradable: 1 + maxUpgradablePercent: 100% + stabilizationPeriod: 10s # <.> + previousVersionPodCount: 0 # <.> + upgradeOrderType: Nodes # <.> + upgradeOrder: + - node-1 + - node-2 +---- +<.> UpgradeProcess can be one of SwapRebalance or InPlaceUpgrade, defaulting to SwapRebalance. The strategy by which SwapRebalance will create pods on the upgraded version is determined by the chosen UpgradeStrategy. InPlaceUpgrade can only be used when strategy is set to RollingUpgrade. -For rolling and immediate upgrades, the process is as follows: +- During a SwapRebalance, one or more candidate pods are selected and new pods are created for each. Data is then rebalanced from the old pods before the candidate pods are deleted. -* One or more candidate pods are selected -* New pods are created for each of the candidates with the same Couchbase configuration as the existing ones -* Data is rebalanced from the old pods to the new ones -* The candidate pods are deleted +- During an InPlaceUpgrade, one or more candidate pods are selected and failed over. Those pods then have their volumes detached and then replaced with pods on the new version. Operator will then bind the existing volumes onto the new pods. This process is quicker than a rolling upgrade and pod names and network settings will be retained by the updated pod. -For in place upgrades, the process is as follows: +<.> UpgradeStrategy determines how a SwapRebalance will create new pods, either by RollingUpgrade or ImmediateUpgrade. InPlaceUpgrades require a RollingUpgrade strategy. -* One or more candidate pods are selected -* Each candidate is failed over and updated with the new version -* Each PVC associated with the candidate is updated with the new version -* Operator performs an in place upgrade on the candidate +- RollingUpgrade performs the upgrade operation a predetermined number of pods at a time, limiting the risk of an issue and can more easily be rolled back. A rolling upgrade requires the least network and compute overhead, so will affect the client operation less. -When using rolling upgrades, performing the operation one pod at a time, risk is limited in the event of an issue, and can be rolled back. -A rolling upgrade requires the least network and compute overhead, so will affect client operation less. +- ImmediateUpgrade poses greater risk and will increase resource utilization during the upgrade, however the operation itself is significantly faster. This strategy may have an undesired effect on client performance as all pods in the cluster are undergoing upgrade at the same time. -When using immediate upgrades, there is a greater risk, and requires greater resource during the upgrade, however the operation itself is significantly faster. -Immediate upgrades may have an undesired effect on client performance as all pods in the cluster are undergoing upgrade at the same time. +<.> StabilizationPeriod lets you control how long the Operator should wait between each upgrade cycle, allowing for a period of time where you might want to check for cluster health or service availability. -When using in-place upgrade, the pods and PVCs are updated to use the new version. This is quicker than a rolling upgrade and will retain the same PVCs (data retained). -In-place upgrade cannot be used in conjunction with immediate upgrades. The validator will fail the request if immediate upgrade strategy and in-place upgrade process are both specified. -The same pod names and networking settings will be retained when the pod is restarted by the operator. +<.> PreviousVersionPodCount tells the Operator to leave a fixed number of pods running the previous version. This might be useful to allow for rollbacks or an extended StabilizationPeriod on the final pods during a cluster upgrade. Couchbase Server will not be considered upgraded until all pods in the cluster are on the new version. The Operator will also consider the cluster to be in “Mixed Mode”, which restricts some features. -You can balance time against risk by tailoring rolling updates with the xref:resource/couchbasecluster.adoc#couchbaseclusters-spec-rollingupgrade[`couchbaseclusters.spec.rollingUpgrade`] configuration parameter. -This allows rolling upgrades to upgrade 2 pods at a time, or 20% of the cluster at a time, for example. +<.> UpgradeOrderType gives you granular upgrade ordering controls to allow for pre-determination on the order by which the Operator will upgrade pods to the new version. UpgradeOrderType determines how we interpret UpgradeOrder. This must be one of Nodes, ServerGroups, ServerClasses or Services. The sequence given in UpgradeOrder will then be used for upgrade ordering. Anything missing from this sequence will revert to the default ordering mechanism. [NOTE] ==== -In-place upgrades are a volatile process that can lead to interruptions in service. It should not be used when there are less than 2 data nodes defined or data loss -may occur. +InPlaceUpgrade is a volatile process that can lead to interruptions in service. It should not be used when there are less than 2 data nodes defined or data loss may occur. ==== -=== How to Upgrade a cluster +=== How to Upgrade a Cluster -To upgrade a cluster, you need to modify the CouchbaseCluster resource to specify the new version of Couchbase Server that you want to upgrade to. -This can be done by updating the `spec.image` field in the CouchbaseCluster manifest to the new version. +To upgrade a cluster, you need to modify the CouchbaseCluster resource to specify the new version of Couchbase Server that you want to upgrade to. This is done by updating the `spec.image` field in the CouchbaseCluster manifest to the new version. -For example, if you were upgrading from version 7.2.4 to 7.6.0 then the CouchbaseCluster resource needs to be updated as follows: +For example, if you were upgrading from Couchbase Server 7.6.7 to 8.0.0 then the manifest resource needs to be updated as follows: [source,yaml] ----- +--- apiVersion: couchbase.com/v2 kind: CouchbaseCluster metadata: name: my-couchbase-cluster spec: - image: couchbase/server:7.6.0 ----- - -=== Couchbase Server Upgrade Constraints + image: couchbase/server:8.0.0 +--- -Not all upgrade paths are supported by Couchbase Server. -The Operator enforces the following constraints on Couchbase Server version upgrades. +During an upgrade or while the cluster is in Mixed Mode (via PreviousVersionPodCount), bucket storage backend migrations and changes to sidecar containers like cloud native gateway and fluent bit will be disabled. -* Upgrades must be an upgrade, downgrades are not supported due to potential protocol incompatibility, for example: -** 5.5.0 to 5.5.3 is allowed -** 5.5.3 to 5.5.0 is not allowed -* Upgrades must not cross multiple major version boundaries, for example: -** 5.x.x to 6.x.x is allowed -** 5.x.x to 7.x.x is not allowed -* Couchbase Server versions cannot be changed during an upgrade - -Refer to the Couchbase Server xref:server:install:upgrade.adoc[upgrade documentation] for more information about direct upgrade paths. +Refer to the Couchbase Server upgrade documentation for information on permitted upgrade paths. [NOTE] ==== -Modifying the Couchbase Server version during an upgrade is permitted only if the end result is a roll-back to the previous version. +Modifying the Couchbase Server version while an upgrade is in progress is permitted only if the version you are changing to is a roll-back to the previous version. ==== === Rollback -The Couchbase Operator provides the capability to roll back while it is in progress. A rollback can only be performed if the upgrade is in progress and has not yet completed. Once a cluster has fully upgraded to the new version the cluster can no longer be downgraded to a previous version. A rollback will simply replace the nodes on the new version and replace them with nodes with the version they had before an upgrade was started. +The Operator provides the capability to rollback upgrades in progress. A rollback is only allowed while there are pods in the cluster still running the previous version. This should only occur while an upgrade is in progress or the cluster is in mixed mode. Once a cluster is fully upgraded to the new version, it can no longer be rolled back. A rollback simply consists of replacing the newly created pods with ones running the previous version. Like an upgrade, the strategy, process and ordering of a rollback is still determined via the `spec.upgrade`` configuration. -==== How to Roll Back an Upgrade +==== How to Rollback a Cluster -To initiate a rollback, you need to modify the CouchbaseCluster resource to specify the previous version of Couchbase Server that was running before the upgrade started. This can be done by updating the `spec.image` field in the CouchbaseCluster manifest to the previous version. +To initiate a rollback, you need to modify the CouchbaseCluster manifest to specify the previous version of Couchbase Server that was running before the upgrade started. This is done by updating the `spec.image`` field. -For example, if you were upgrading from version 7.2.4 to 7.6.0 and encountered issues, you can roll back to version 7.2.4 by updating the manifest as follows: +For example, if you were upgrading from Couchbase Server 7.6.7 to 8.0.0 and encountered issues that require a rollback, you can change the version back to 7.6.7 by updating the manifest as follows: [source,yaml] ----- +--- apiVersion: couchbase.com/v2 kind: CouchbaseCluster metadata: name: my-couchbase-cluster spec: - image: couchbase/server:7.2.4 ----- - -After applying this change, the operator will begin the rollback process. + image: couchbase/server:7.6.7 +--- === Controlled Upgrades -The Operator provides the capability to control the upgrade process by upgrading specific server classes at a time, this gives the ability to control the upgrade process and ensure that the upgrade process is controlled. For example, it is possible to upgrade the data nodes first, then the query nodes, and finally the index nodes. - -Whilst performing a controlled upgrade the cluster will be in a mixed state with different versions of Couchbase Server running at the same time. This time should be kept to a minimum to avoid potential issues with the cluster. - -==== How to Perform a Controlled Upgrade -To perform a controlled upgrade you need to first modify the CouchbaseCluster Image to specify the image that you want to upgrade to. This can be done by updating the `spec.image` field in the CouchbaseCluster manifest. You can then specify the server classes that you want to upgrade first by updating the `spec.servers.image` to the older image for the server classes you do not want to be upgraded first. You can explicitly update `spec.servers.image` for the server classes that you want to upgrade first, alternatively if the image is not specified for a server class then the image specified in `spec.image` will be used. - -.Example `CouchbaseCluster` Resource with three Server Classes -[source,yaml] ----- -apiVersion: couchbase.com/v2 -kind: CouchbaseCluster -metadata: - name: cb-example -spec: - image: couchbase/server:7.2.4 - ... - servers: - - size: 3 - name: data - services: - - data - - size: 3 - name: index - services: - - index - - size: 3 - name: query - services: - - query ----- - -For the example above if you want to upgrade the data nodes first, then the query nodes, and finally the index nodes you can update the `CouchbaseCluster` resource as follows: - -.Example `CouchbaseCluster` Resource with Controlled Upgrade data nodes first -[source,yaml] ----- -apiVersion: couchbase.com/v2 -kind: CouchbaseCluster -metadata: - name: cb-example -spec: - image: couchbase/server:7.6.0 # <.> - ... - servers: - - size: 3 - name: data - services: - - data - - size: 3 - name: index - image: couchbase/server:7.2.4 # <.> - services: - - index - - size: 3 - image: couchbase/server:7.2.4 # <.> - name: query - services: - - query ----- - -<.> The cluster image needs to be the latest version that you want to upgrade to. - -<.> The image for the index server class is set to the older version to ensure that the index nodes are not upgraded yet. - -<.> The image for the query server class is set to the older version to ensure that the query nodes are not upgraded yet. - -.Example `CouchbaseCluster` Resource with Controlled Upgrade Upgrading query nodes second -[source,yaml] ----- -apiVersion: couchbase.com/v2 -kind: CouchbaseCluster -metadata: - name: cb-example -spec: - image: couchbase/server:7.6.0 - ... - servers: - - size: 3 - name: data - services: - - data - - size: 3 - name: index - image: couchbase/server:7.2.4 # <.> - services: - - index - - size: 3 - name: query - services: - - query ----- - -<.> The image for the index server class is set to the older version to ensure that the index nodes are not upgraded yet. +The Operator provides the capability to control the upgrade process using the fields in couchbasecluster.spec.upgrade. For example, it’s possible to upgrade the pods depending on which availability region they are running in, or alternatively which Couchbase Server services they are running. As an administrator you might want to upgrade pods that run the data service before the query service, or perhaps a specific pod with a given name should be upgraded before another. -.Example `CouchbaseCluster` Resource with Controlled Upgrade Upgrading index nodes last -[source,yaml] ----- -apiVersion: couchbase.com/v2 -kind: CouchbaseCluster -metadata: - name: cb-example -spec: - image: couchbase/server:7.6.0 - ... - servers: - - size: 3 - name: data - services: - - data - - size: 3 - name: index - services: - - index - - size: 3 - name: query - services: - - query ----- - -==== How to Perform a Controlled Rollback - -The Operator also allows you to perform a controlled rollback. To perform a controlled rollback you would performed the controlled rollback steps in reverse. The following examples shows how to perform a controlled rollback on a example cluster where all the server classes but one have upgraded to the new version. Note that once a cluster is fully upgraded to the new version a rollback is no longer possible. - -.Example Starting `CouchbaseCluster` Resource for Controlled Rollback -[source,yaml] ----- -apiVersion: couchbase.com/v2 -kind: CouchbaseCluster -metadata: - name: cb-example -spec: - image: couchbase/server:7.6.0 # <.> - ... - servers: - - size: 3 - name: data - services: - - data - - size: 3 - name: index - image: couchbase/server:7.2.4 # <.> - services: - - index - - size: 3 - name: query - services: - - query ----- - -<.> The cluster image is the latest version that is being upgraded to. - -<.> The image for the index server class is set to the older version indicating that the index nodes are not upgraded yet. - -.Example `CouchbaseCluster` Resource Controlled Rollback Query Nodes -[source,yaml] ----- -apiVersion: couchbase.com/v2 -kind: CouchbaseCluster -metadata: - name: cb-example -spec: - image: couchbase/server:7.6.0 # <.> - ... - servers: - - size: 3 - name: data - services: - - data - - size: 3 - name: index - image: couchbase/server:7.2.4 - services: - - index - - size: 3 - name: query - image: couchbase/server:7.2.4 # <.> - services: - - query ----- - -<.> During a controlled rollback the cluster image should stay as the version that was being upgraded to. If this is changed to the older version then the cluster will all be rolled back to the older version without control over the server class order. - -<.> The image for the query server class is set to the older version indicating that the query nodes should be rolled back next. - -.Example `CouchbaseCluster` Resource Controlled Rollback Remaining Data Nodes -[source,yaml] ----- -apiVersion: couchbase.com/v2 -kind: CouchbaseCluster -metadata: - name: cb-example -spec: - image: couchbase/server:7.2.4 # <.> - ... - servers: - - size: 3 - name: data - services: - - data - - size: 3 - name: index - services: - - index - - size: 3 - name: query - services: - - query ----- - -<.> As there is only one server class left to rollback the cluster image can be set to the older version and the remaining server nodes in the data class will be rolled back. +[NOTE] +==== +For as long as there are pods running two versions in a cluster, the Operator will consider the cluster to be in “Mixed Mode” and restrict some features (bucket migration, sidecar changes). Ideally the time spent in this mode is kept to a minimum. +==== -== Upgrading Pods +=== Upgrading Pods -In Kubernetes pods are immutable -- they cannot be modified once they are created. -If a `CouchbaseCluster` configuration value is modified that would also modify the underlying pod, then the Operator must create a new pod to replace the old one that does not match the required specification. -Pod upgrades work in exactly the same way as an upgrade to the Couchbase Server version; in fact upgrading the Couchbase Server image is just a subset of modifying any other pod specification parameter. +Pods in Kubernetes are immutable. If a CouchbaseCluster manifest is modified in a way that requires a replacement of any underlying pods, the Operator will create these new pods in the same way an upgrade occurs to replace a pod on an old version. In fact, upgrading Couchbase Server is just a subset of modifying any other pod specification parameter. -The Operator compares the required pod specification with the one used to create the original pod, a candidate is selected if the specifications differ. -The Operator therefore can perform the following tasks: +The Operator builds pod specifications depending on the manifest and compares those to the existing pods in the cluster. Candidates are selected from any where the specifications differ. The Operator therefore can perform the following tasks: +* Modification of scheduling constraints * Modification of environment variables -* xref:concept-scheduling.adoc[Modification of scheduling constraints] -* xref:concept-memory-allocation.adoc[Modification of memory constraints] -* xref:concept-tls.adoc[Enabling and disabling of TLS] -* xref:concept-persistent-volumes.adoc[Enabling and disabling of persistent storage] +* Modification of memory constraints +* Modification of Couchbase services +* Enabling and disabling of TLS +* Enabling and disabling of persistent storage This mechanism allows a cluster to be used from evaluation right up to production, with features enabled as they are required, without service disruption. -== Upgrading Persistent Volumes +=== Upgrading Persistent Volumes -Online persistent volume resizing support is not yet available in all supported versions of Kubernetes. -As a result the Operator supports xref:howto-persistent-volume-resize.adoc[persistent volume resizing] using a similar mechanism to <>. -The Operator will detect a modification to the specification of a persistent volume template and schedule a pod upgrade in order to satisfy the request. +The Operator supports modification to volumes using a similar mechanism to upgrading pods. Changes to a manifest that require updates to persistent volumes will result in a pod upgrade in order to satisfy the request. Persistent volumes that are already in use can be optionally expanded in-place using `couchbasecluster.spec.enableOnlineVolumeExpansion` without needing to perform an upgrade on the pod, as long as the storage class and Kubernetes cluster support persistent volume resizing. -During an in-place upgrade, the PVC will be updated to include the updated image and couchbase server versions. The size of the PVC and data within it will not be edited. +During an InPlaceUpgrade, any PVC's will be updated to include the updated image and Couchbase Server versions. The size of the PVC and data within will not be edited unless modification is also required to match the desired cluster manifest. \ No newline at end of file diff --git a/modules/ROOT/pages/howto-couchbase-upgrade.adoc b/modules/ROOT/pages/howto-couchbase-upgrade.adoc index bfe39fb1..cbe4b0cc 100644 --- a/modules/ROOT/pages/howto-couchbase-upgrade.adoc +++ b/modules/ROOT/pages/howto-couchbase-upgrade.adoc @@ -5,50 +5,195 @@ include::partial$constants.adoc[] [abstract] How-to upgrade Couchbase Server to a newer version. +The Couchbase Server version can be upgraded by changing the `spec.image` in the cluster manifest to a new version. While an upgrade is taking place, up until all pods are running the new version, it can also be rolled back to the previous version. In the cluster manifest, `spec.upgrade` provides a number of upgrade controls for more granular management of the upgrade process. Before upgrading, it’s important to understand the upgrade xref:concept-upgrade.adoc[concepts] and how to use the controls. + +[Note] +==== +During an upgrade or rollback where there are two versions of Couchbase Server running in the cluster, the Operator will consider the cluster in “Mixed Mode”. During which time, sidecar pod modifications and bucket storage backend migrations are disabled. +==== + +== Upgrading a Cluster + Given the existing configuration: -[source,yaml,subs="attributes,verbatim"] +[source,yaml] ---- apiVersion: couchbase.com/v2 kind: CouchbaseCluster spec: - image: couchbase/server:{couchbase-version-upgrade-from} # <.> + image: couchbase/server:7.6.8 # <.> ---- -<.> xref:resource/couchbasecluster.adoc#couchbaseclusters-spec-image[`couchbaseclusters.spec.image`] can be modified to any valid Couchbase Server image, in this example we want to upgrade the version only. +<.> The image can be modified to any valid Couchbase Server image. -[source,yaml,subs="attributes,verbatim"] +[source,yaml] ---- apiVersion: couchbase.com/v2 kind: CouchbaseCluster spec: - image: couchbase/server:{couchbase-version} # <.> + image: couchbase/server:8.0.0 # <.> ---- -<.> The modification will trigger the Operator to detect that existing pod specifications do not match the new pod specifications. -This will perform a xref:concept-upgrade.adoc#upgrading-couchbase-server[rolling upgrade] of Couchbase Server. +<.> The modification will trigger the Operator to compare existing pod specifications, which use the old image, to new pod specifications, which will use the new image. Since they don't match, by default the Operator will begin a SwapRebalance upgrade process using the RollingUpgrade strategy. This creates new pods and rebalances them into the cluster before ejecting the old pods. -== In Place Upgrade -Given the existing configuration: +=== In Place Upgrade +Assuming spec.image is equal to a version below couchbase/server:8.0.0, update the manifest to: + +[source,yaml] +---- +apiVersion: couchbase.com/v2 +kind: CouchbaseCluster +spec: + image: couchbase/server:8.0.0 # <.> + upgrade: + upgradeProcess: InPlaceUpgrade # <.> +---- + +<.> The version we want to upgrade to. +<.> Inform the Operator we want to perform in-place upgrades of the existing pods. This re-creates pods in-place using the same name and the same persistent volume. + +=== Rolling Upgrade Controls + +Assuming spec.image is equal to a version below couchbase/server:8.0.0, update the manifest to: + +[source,yaml] +---- +apiVersion: couchbase.com/v2 +kind: CouchbaseCluster +spec: + image: couchbase/server:8.0.0 <.> + upgrade: + upgradeProcess: SwapRebalance + upgradeStrategy: RollingUpgrade + rollingUpgrade: <.> + maxUpgradable: 3 <.> + maxUpgradablePercent: 10 <.> +---- + +<.> The version we want to upgrade to. +<.> couchbasecluster.spec.upgrade.rollingUpgrade has two options. Either or both can be configured and the Operator will upgrade the lower number of pods per cycle. +<.> Configure Operator to only upgrade 3 pods at a time. +<.> Configure Operator to upgrade at most 10% of pods at a time relative to the total cluster size, rounded down. +If the above couchbasecluster.spec.upgrade.rollingUpgrade was used on a cluster with 10 pods, only 1 would be upgraded at a time, as the percentage value of the total is lower than the fixed number and therefore takes precedence. If however we had a cluster with 60 pods, the Operator would upgrade 3 at a time as the fixed number is now lower than the percentage. + +=== Granular Controls -[source,yaml,subs="attributes,verbatim"] +Assuming spec.image is equal to a version below couchbase/server:8.0.0, update the manifest to: + +[source,yaml] ---- apiVersion: couchbase.com/v2 kind: CouchbaseCluster spec: - image: couchbase/server:{couchbase-version-upgrade-from} - upgradeProcess: InPlaceUpgrade # <.> + image: couchbase/server:8.0.0 <.> + upgrade: + upgradeProcess: SwapRebalance + upgradeStrategy: RollingUpgrade + rollingUpgrade: + maxUpgradable: 1 + maxUpgradablePercent: 20 + stabilizationPeriod: 5m <.> + previousVersionPodCount: 4 <.> + upgradeOrderType: Nodes <.> + upgradeOrder: <.> + - cb-instance-3 + - cb-instance-1 + - cb-instance-2 +---- + +<.> The version we want to upgrade to. +<.> The Operator will wait for this period of time between each upgrade cycle. In this instance, every time an upgrade cycle occurs, there will be a 5 minute wait until the next one. During this period, normal Operator functions, other than those that are disabled during upgrades will continue to run. +<.> The Operator will leave 4 pods on the previous version and not upgrade them. This might be useful if you want an extended stabilization period for a set number of nodes. Couchbase Server and Operator will not consider an upgrade complete until all pods are running on the new/same version. Features only available on the upgraded version will not be available until this is the case. +<.> Determine the order that the Operator will upgrade pods, either by pod (Couchbase node) name, server group, server class or by the Couchbase services running on a pod. You might want to upgrade pods in a specific availability zone first before others, or perhaps your pods running the data service should be upgraded before those with the index service. +<.> Depending on the upgradeOrderType, this field is a non-exhaustive list of the order Operator should upgrade pods by that type. In this case, the Operator will upgrade pods by Node name, meaning cb-instance-3 will be upgraded first, followed by cb-instance-1 and finally cb-instance-2. Any remaining pods that aren’t selected by the list will be upgraded in alphabetical order. + +=== Order Upgrades by Server Group + +Assuming spec.image is equal to a version below couchbase/server:8.0.0, update the manifest to: + +[source,yaml] ---- +apiVersion: couchbase.com/v2 +kind: CouchbaseCluster +spec: + serverGroups: + - zone-1 + - zone-2 + - zone-3 + image: couchbase/server:8.0.0 <.> + upgrade: + upgradeOrderType: ServerGroups <.> + upgradeOrder <.> + - zone-3 +---- + +<.> The version we want to upgrade to. +<.> The order type we want the Operator to use. +<.> Upgrade pods running in server group zone-3 first. Pods running on server groups not in this list, zone-1 and zone-2, will be upgraded in the order given in the couchbasecluster.spec.serverGroups list. + +=== Order Upgrades by Server Class + +Assuming spec.image is equal to a version below couchbase/server:8.0.0, update the manifest to: + +[source,yaml] +---- +apiVersion: couchbase.com/v2 +kind: CouchbaseCluster +spec: + servers: + - name: data_only + services: + - data + size: 2 + - name: idx_query + services: + - index + - query + size: 6 + image: couchbase/server:8.0.0 <.> + upgrade: + upgradeOrderType: ServerClasses <.> + upgradeOrder: <.> + - idx_query + - data_only +---- + +<.> The version we want to upgrade to. +<.> The order type we want the Operator to use. +<.> Upgrade pods running in server class idx_query first, followed by pods running in data_only. The maximum number of pods that will be upgraded on each cycle will be limited by either the `spec.upgrade.rollingUpgrade.maxUpgradable` field or the number of pods in the server class being upgraded, whichever is smaller. The Operator will not upgrade more than one server class at a time. If the number of pods running a server class is less than the `spec.upgrade.rollingUpgrade.maxUpgradable` amount, the Operator will not upgrade pods from the next server class in the list. + +=== Order Upgrades by Service -<.> This field will inform the operator that we want to perform an in-place upgrade of the existing pods. +Assuming spec.image is equal to a version below couchbase/server:8.0.0, update the manifest to: -[source,yaml,subs="attributes,verbatim"] +[source,yaml] ---- apiVersion: couchbase.com/v2 kind: CouchbaseCluster spec: - image: couchbase/server:{couchbase-version} # <.> - upgradeProcess: InPlaceUpgrade + servers: + - name: data_only + services: + - data + size: 2 + - name: idx_query + services: + - index + - query + size: 2 + - name: query_only + services: + - query + size: 2 + image: couchbase/server:8.0.0 <.> + upgrade: + upgradeOrderType: Services <.> + upgradeOrder: <.> + - index + - data + - query ---- -<.> The Operator will detect that existing pod specifications do not match the new pod specifications and trigger an in-place upgrade. \ No newline at end of file +<.> The version we want to upgrade to. +<.> The order type we want the Operator to use. +<.> The sequence by which the Operator should upgrade services. Pods running multiple services will be ordered by the first one of those services in the sequence list. Using this manifest, pods running the index service (those in the idx_query server class) will be upgraded first, followed by the pods running the data service, and finally the pods running the query service. If the number of pods running a service is less than the `spec.upgrade.rollingUpgrade.maxUpgradable` amount, the Operator will not upgrade pods from the next service in the list. If the upgrade order sequence does not include all services, pods running services not in the list will be upgraded using the default order: data, query, index, search, analytics, eventing. \ No newline at end of file From d9c4f4bbc6e64b84af759bb3bb28b3329d54129b Mon Sep 17 00:00:00 2001 From: BenMotts Date: Tue, 16 Dec 2025 11:46:55 +0000 Subject: [PATCH 2/2] K8S-4097: Mir Watchdog docs --- modules/ROOT/nav.adoc | 1 + modules/ROOT/pages/tutorial-mirwatchdog.adoc | 63 ++++++++++++++++++++ 2 files changed, 64 insertions(+) create mode 100644 modules/ROOT/pages/tutorial-mirwatchdog.adoc diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index 10fa5303..29ed3fb5 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -135,6 +135,7 @@ include::partial$autogen-reference.adoc[] * Logging ** xref:tutorial-couchbase-log-forwarding.adoc[] * Monitoring + ** xref:tutorial-mirwatchdog.adoc[Monitor for Manual Intervention Scenarios] ** xref:tutorial-prometheus.adoc[Quick Start with Prometheus Monitoring] * Networking ** xref:tutorial-remote-dns.adoc[Inter-Kubernetes Networking with Forwarded DNS] diff --git a/modules/ROOT/pages/tutorial-mirwatchdog.adoc b/modules/ROOT/pages/tutorial-mirwatchdog.adoc new file mode 100644 index 00000000..36b5a381 --- /dev/null +++ b/modules/ROOT/pages/tutorial-mirwatchdog.adoc @@ -0,0 +1,63 @@ += Monitor for Manual Intervention Scenarios + +[abstract] +Use the Manual Intervention Required Watchdog to monitor and alert for cluster scenarios that the Operator is unable to automatically resolve. + +include::partial$tutorial.adoc[] + +== Overview + +While the Operator is designed to be able to automatically resolve most issues, Manual Intervention Required (MIR) is a new state that, when enabled, a Couchbase Cluster will enter in the unlikely scenario that the Operator is unable to reconcile due to reasons outside of it's control/capabilities, and which therefore require manual intervention by a user. These additional checks are ran by the "Manual Intervention Required Watchdog". + +=== Enable the Manual Intervention Required Watchdog + +Enable the Manual Intervention Required Watchdog on a per-cluster basis in the `couchbaseclusters` CRD. + +[source,yaml] +---- +spec: + mirWatchdog: + enabled: true <.> + interval: 20s <.> + skipReconciliation: false <.> +---- + +<.> Enable the Manual Intervention Required Watchdog. Default is false. +<.> Set the interval for the Manual Intervention Required Watchdog to check for MIR conditions. Default is 20 seconds. +<.> Set whether to skip reconciliation when in the MIR state. Default is false. + +==== Alerting + +This is designed to be accompanied by additional alerting on the Kubernetes event/cluster condition/metrics, hence the reason for defaulting to false. If a cluster enters the MIR state, it will: + + +* Set the `cluster_manual_intervention` gauge metric to 1 +* Add (when possible) the ManualInterventionRequired condition to the cluster, with a message detailing the reason for entering the MIR state +* Raise a ManualInterventionRequired Kubernetes event, with the event message set to the reason for entering manual intervention +* Depending on the `spec.mirWatchdog.skipReconciliation` setting, reconciliation can optionally be skipped until the manual intervention required state has been resolved, i.e. the issue that put the cluster into that condition has been fixed. + + +==== Manual Intervention Required Scenarios + +For each of the checks the watchdog performs, entry and exit conditions determine whether to enter the MIR state or exit the MIR state. The currently supported checks are: + + +===== Consecutive Rebalance Failures +* Entry: All rebalance retry attempts are exhausted for 3 consecutive reconciliation loops. +* Exit: The cluster is balanced and all nodes have been activated. + + +===== Couchbase Cluster Authentication Failure +* Entry: The operator is unable to use the provided Couchbase Cluster credentials to authenticate with the cluster. +* Exit: Authentication Succeeds. + + +===== Down Nodes when Quorum is Lost +* Entry: There are down nodes that cannot be recovered by the operator. +* Exit: There are no unrecoverable down nodes. + + +===== TLS Certificate Expiration +* Entry: Any of the CA, Client or Server Certificate Chain expires and do not have valid alternatives the Operator can rotate them for. +* Exit: TLS certs are no longer expired or the Operator has valid alternatives it can use to rotate any that are expired. +