-
Notifications
You must be signed in to change notification settings - Fork 8
K8S-4395: Create docs for new manual intervention required watchdog #50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
4 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,83 @@ | ||
| = Monitor for Manual Intervention Scenarios | ||
|
|
||
| [abstract] | ||
| Use the Manual Intervention Required Watchdog to monitor cluster scenarios and alert you when the Operator cannot automatically resolve them. | ||
|
|
||
| include::partial$tutorial.adoc[] | ||
|
|
||
| == Overview | ||
|
|
||
| The Operator automatically resolves most cluster issues without user involvement. | ||
| However, some scenarios fall outside the Operator's control and require manual intervention. | ||
| The Manual Intervention Required (MIR) Watchdog monitors for these scenarios and places the cluster into a special MIR state when they occur, | ||
| alerting you to take action. | ||
|
|
||
|
|
||
| === Enable the Manual Intervention Required Watchdog | ||
|
|
||
| Enable the Manual Intervention Required Watchdog for each cluster in the `CouchbaseCluster` CRD (Custom Resource Definitions). | ||
|
|
||
| [source,yaml] | ||
| ---- | ||
| spec: | ||
| mirWatchdog: | ||
| enabled: true # <.> | ||
| interval: 20s # <.> | ||
| skipReconciliation: false # <.> | ||
| ---- | ||
|
|
||
| <.> Enable the Manual Intervention Required Watchdog. | ||
| The default value is `false`. | ||
| <.> Set the interval at which the Manual Intervention Required Watchdog checks for MIR conditions. | ||
| The default value is 20 seconds. | ||
| <.> Specify whether the Operator skips reconciliation when the cluster is in the MIR state. | ||
| The default value is `false`. | ||
|
|
||
| ==== Alerting | ||
|
|
||
| The Manual Intervention Required Watchdog is designed to work with additional alerting based on Kubernetes events, cluster conditions, or metrics. | ||
|
|
||
| When a cluster enters the MIR state, the Operator performs the following actions: | ||
|
|
||
| * Sets the `cluster_manual_intervention` gauge metric to 1. | ||
|
|
||
| * Adds the `ManualInterventionRequired` condition to the cluster, when possible, and includes a message that explains the reason for cluster entering the MIR state. | ||
|
|
||
| * Raises a `ManualInterventionRequired` Kubernetes event with a message that describes the reason for manual intervention. | ||
|
|
||
| * Optionally skips reconciliation based on the `spec.mirWatchdog.skipReconciliation` setting until you resolve the issue that caused the MIR state. | ||
|
|
||
| ==== Manual Intervention Required Scenarios | ||
|
|
||
| For each check that the Manual Intervention Required Watchdog performs, the defined entry and exit conditions determine whether the cluster enters or exits the MIR state. | ||
|
|
||
| The supported Manual Intervention Required Watchdog checks are as follows: | ||
|
|
||
| * <<consecutive-rebalance-failures, Consecutive Rebalance Failures>> | ||
| * <<couchbase-cluster-authentication-failure, Couchbase Cluster Authentication Failure>> | ||
| * <<down-nodes-when-quorum-is-lost, Down Nodes when Quorum is Lost>> | ||
| * <<tls-certificate-expiration, TLS Certificate Expiration>> | ||
|
|
||
| [#consecutive-rebalance-failures] | ||
| ===== Consecutive Rebalance Failures | ||
|
|
||
| * Entry: After the Operator exhausts all rebalance retry attempts in 3 consecutive reconciliation loops. | ||
| * Exit: After the cluster becomes balanced and the Operator activates all nodes. | ||
|
|
||
| [#couchbase-cluster-authentication-failure] | ||
| ===== Couchbase Cluster Authentication Failure | ||
|
|
||
| * Entry: The Operator fails to authenticate with the cluster by using the provided Couchbase cluster credentials. | ||
| * Exit: The Operator succeeds to authenticate with the cluster. | ||
|
|
||
| [#down-nodes-when-quorum-is-lost] | ||
| ===== Down Nodes when Quorum is Lost | ||
|
|
||
| * Entry: The Operator detects down nodes that it cannot recover. | ||
| * Exit: The Operator detects no unrecoverable down nodes. | ||
|
|
||
| [#tls-certificate-expiration] | ||
| ===== TLS Certificate Expiration | ||
|
|
||
| * Entry: The Operator detects an expired CA (Certificate Authority), Client or Server Certificate chain, and finds no valid alternative certificates for rotation. | ||
| * Exit: The Operator detects no expired TLS certificates or identifies valid alternative certificates available for rotation. | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.