diff --git a/modules/ROOT/nav.adoc b/modules/ROOT/nav.adoc index 6a13c23..d5f0122 100644 --- a/modules/ROOT/nav.adoc +++ b/modules/ROOT/nav.adoc @@ -140,6 +140,7 @@ include::partial$autogen-reference.adoc[] * Logging ** xref:tutorial-couchbase-log-forwarding.adoc[] * Monitoring + ** xref:tutorial-mirwatchdog.adoc[Monitor for Manual Intervention Scenarios] ** xref:tutorial-prometheus.adoc[Quick Start with Prometheus Monitoring] * Networking ** xref:tutorial-remote-dns.adoc[Inter-Kubernetes Networking with Forwarded DNS] diff --git a/modules/ROOT/pages/tutorial-mirwatchdog.adoc b/modules/ROOT/pages/tutorial-mirwatchdog.adoc new file mode 100644 index 0000000..bc6bf78 --- /dev/null +++ b/modules/ROOT/pages/tutorial-mirwatchdog.adoc @@ -0,0 +1,83 @@ += Monitor for Manual Intervention Scenarios + +[abstract] +Use the Manual Intervention Required Watchdog to monitor cluster scenarios and alert you when the Operator cannot automatically resolve them. + +include::partial$tutorial.adoc[] + +== Overview + +The Operator automatically resolves most cluster issues without user involvement. +However, some scenarios fall outside the Operator's control and require manual intervention. +The Manual Intervention Required (MIR) Watchdog monitors for these scenarios and places the cluster into a special MIR state when they occur, +alerting you to take action. + + +=== Enable the Manual Intervention Required Watchdog + +Enable the Manual Intervention Required Watchdog for each cluster in the `CouchbaseCluster` CRD (Custom Resource Definitions). + +[source,yaml] +---- +spec: + mirWatchdog: + enabled: true # <.> + interval: 20s # <.> + skipReconciliation: false # <.> +---- + +<.> Enable the Manual Intervention Required Watchdog. +The default value is `false`. +<.> Set the interval at which the Manual Intervention Required Watchdog checks for MIR conditions. +The default value is 20 seconds. +<.> Specify whether the Operator skips reconciliation when the cluster is in the MIR state. +The default value is `false`. + +==== Alerting + +The Manual Intervention Required Watchdog is designed to work with additional alerting based on Kubernetes events, cluster conditions, or metrics. + +When a cluster enters the MIR state, the Operator performs the following actions: + +* Sets the `cluster_manual_intervention` gauge metric to 1. + +* Adds the `ManualInterventionRequired` condition to the cluster, when possible, and includes a message that explains the reason for cluster entering the MIR state. + +* Raises a `ManualInterventionRequired` Kubernetes event with a message that describes the reason for manual intervention. + +* Optionally skips reconciliation based on the `spec.mirWatchdog.skipReconciliation` setting until you resolve the issue that caused the MIR state. + +==== Manual Intervention Required Scenarios + +For each check that the Manual Intervention Required Watchdog performs, the defined entry and exit conditions determine whether the cluster enters or exits the MIR state. + +The supported Manual Intervention Required Watchdog checks are as follows: + +* <> +* <> +* <> +* <> + +[#consecutive-rebalance-failures] +===== Consecutive Rebalance Failures + +* Entry: After the Operator exhausts all rebalance retry attempts in 3 consecutive reconciliation loops. +* Exit: After the cluster becomes balanced and the Operator activates all nodes. + +[#couchbase-cluster-authentication-failure] +===== Couchbase Cluster Authentication Failure + +* Entry: The Operator fails to authenticate with the cluster by using the provided Couchbase cluster credentials. +* Exit: The Operator succeeds to authenticate with the cluster. + +[#down-nodes-when-quorum-is-lost] +===== Down Nodes when Quorum is Lost + +* Entry: The Operator detects down nodes that it cannot recover. +* Exit: The Operator detects no unrecoverable down nodes. + +[#tls-certificate-expiration] +===== TLS Certificate Expiration + +* Entry: The Operator detects an expired CA (Certificate Authority), Client or Server Certificate chain, and finds no valid alternative certificates for rotation. +* Exit: The Operator detects no expired TLS certificates or identifies valid alternative certificates available for rotation. diff --git a/preview/HEAD.yml b/preview/HEAD.yml index a29fd69..6d02668 100644 --- a/preview/HEAD.yml +++ b/preview/HEAD.yml @@ -3,4 +3,4 @@ sources: branches: [release/8.0] docs-operator: - branches: [DOC-13857-tutorial-to-detect-avx2, release/2.8] \ No newline at end of file + branches: [DOC-13857-tutorial-to-detect-avx2, release/2.8]