Skip to content
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ include::partial$autogen-reference.adoc[]
* Logging
** xref:tutorial-couchbase-log-forwarding.adoc[]
* Monitoring
** xref:tutorial-mirwatchdog.adoc[Monitor for Manual Intervention Scenarios]
** xref:tutorial-prometheus.adoc[Quick Start with Prometheus Monitoring]
* Networking
** xref:tutorial-remote-dns.adoc[Inter-Kubernetes Networking with Forwarded DNS]
Expand Down
63 changes: 63 additions & 0 deletions modules/ROOT/pages/tutorial-mirwatchdog.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
= Monitor for Manual Intervention Scenarios

[abstract]
Use the Manual Intervention Required Watchdog to monitor and alert for cluster scenarios that the Operator is unable to automatically resolve.

include::partial$tutorial.adoc[]

== Overview

While the Operator is designed to be able to automatically resolve most issues, Manual Intervention Required (MIR) is a new state that, when enabled, a Couchbase Cluster will enter in the unlikely scenario that the Operator is unable to reconcile due to reasons outside of it's control/capabilities, and which therefore require manual intervention by a user. These additional checks are ran by the "Manual Intervention Required Watchdog".

=== Enable the Manual Intervention Required Watchdog

Enable the Manual Intervention Required Watchdog on a per-cluster basis in the `couchbaseclusters` CRD.

[source,yaml]
----
spec:
mirWatchdog:
enabled: true <.>
interval: 20s <.>
skipReconciliation: false <.>
----

<.> Enable the Manual Intervention Required Watchdog. Default is false.
<.> Set the interval for the Manual Intervention Required Watchdog to check for MIR conditions. Default is 20 seconds.
<.> Set whether to skip reconciliation when in the MIR state. Default is false.

==== Alerting

This is designed to be accompanied by additional alerting on the Kubernetes event/cluster condition/metrics, hence the reason for defaulting to false. If a cluster enters the MIR state, it will:


* Set the `cluster_manual_intervention` gauge metric to 1
* Add (when possible) the ManualInterventionRequired condition to the cluster, with a message detailing the reason for entering the MIR state
* Raise a ManualInterventionRequired Kubernetes event, with the event message set to the reason for entering manual intervention
* Depending on the `spec.mirWatchdog.skipReconciliation` setting, reconciliation can optionally be skipped until the manual intervention required state has been resolved, i.e. the issue that put the cluster into that condition has been fixed.


==== Manual Intervention Required Scenarios

For each of the checks the watchdog performs, entry and exit conditions determine whether to enter the MIR state or exit the MIR state. The currently supported checks are:


===== Consecutive Rebalance Failures
* Entry: All rebalance retry attempts are exhausted for 3 consecutive reconciliation loops.
* Exit: The cluster is balanced and all nodes have been activated.


===== Couchbase Cluster Authentication Failure
* Entry: The operator is unable to use the provided Couchbase Cluster credentials to authenticate with the cluster.
* Exit: Authentication Succeeds.


===== Down Nodes when Quorum is Lost
* Entry: There are down nodes that cannot be recovered by the operator.
* Exit: There are no unrecoverable down nodes.


===== TLS Certificate Expiration
* Entry: Any of the CA, Client or Server Certificate Chain expires and do not have valid alternatives the Operator can rotate them for.
* Exit: TLS certs are no longer expired or the Operator has valid alternatives it can use to rotate any that are expired.