Skip to content

Commit 2952a88

Browse files
committed
major reorg on dr/shadowing
1 parent 059f774 commit 2952a88

File tree

14 files changed

+414
-323
lines changed

14 files changed

+414
-323
lines changed

modules/ROOT/nav.adoc

Lines changed: 10 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -85,8 +85,15 @@
8585
***** xref:deploy:redpanda/manual/production/production-deployment.adoc[]
8686
***** xref:deploy:redpanda/manual/production/production-readiness.adoc[]
8787
**** xref:deploy:redpanda/manual/high-availability.adoc[High Availability]
88-
**** xref:deploy:redpanda/manual/resilience/shadowing.adoc[Shadowing]
89-
**** xref:deploy:redpanda/manual/resilience/shadowing-guide.adoc[]
88+
**** xref:deploy:redpanda/manual/resilience/index.adoc[Resilience]
89+
***** xref:deploy:redpanda/manual/resilience/disaster-recovery/index.adoc[Disaster Recovery]
90+
***** xref:deploy:redpanda/manual/resilience/disaster-recovery/setup.adoc[Setup and Configuration]
91+
***** xref:deploy:redpanda/manual/resilience/disaster-recovery/monitor.adoc[Monitoring and Operations]
92+
***** xref:deploy:redpanda/manual/resilience/disaster-recovery/failover.adoc[Planned Failover]
93+
***** xref:deploy:redpanda/manual/resilience/disaster-recovery/emergency-failover.adoc[Emergency Failover]
94+
***** xref:deploy:redpanda/manual/resilience/whole-cluster-restore.adoc[Whole Cluster Restore]
95+
***** xref:deploy:redpanda/manual/resilience/topic-recovery.adoc[Topic Recovery]
96+
***** xref:deploy:redpanda/manual/resilience/remote-read-replicas.adoc[Remote Read Replicas]
9097
**** xref:deploy:redpanda/manual/sizing-use-cases.adoc[Sizing Use Cases]
9198
**** xref:deploy:redpanda/manual/sizing.adoc[Sizing Guidelines]
9299
**** xref:deploy:redpanda/manual/linux-system-tuning.adoc[System Tuning]
@@ -179,9 +186,7 @@
179186
*** xref:manage:tiered-storage.adoc[]
180187
*** xref:manage:fast-commission-decommission.adoc[]
181188
*** xref:manage:mountable-topics.adoc[]
182-
*** xref:manage:remote-read-replicas.adoc[Remote Read Replicas]
183-
*** xref:manage:topic-recovery.adoc[Topic Recovery]
184-
*** xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore]
189+
185190
** xref:manage:iceberg/index.adoc[Iceberg]
186191
*** xref:manage:iceberg/about-iceberg-topics.adoc[About Iceberg Topics]
187192
*** xref:manage:iceberg/specify-iceberg-schema.adoc[Specify Iceberg Schema]

modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc renamed to modules/deploy/pages/redpanda/manual/resilience/disaster-recovery/emergency-failover.adoc

Lines changed: 4 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -1,20 +1,18 @@
1-
= Shadowing Guide
1+
= Emergency Failover
22
:description: Step-by-step emergency guide for failing over Redpanda shadow links during disasters.
3+
:page-aliases: deploy:redpanda/manual/resilience/shadowing-guide.adoc
34
:env-linux: true
45
:page-categories: Management, High Availability, Disaster Recovery, Emergency Response
56

6-
[NOTE]
7-
====
8-
include::shared:partial$enterprise-license.adoc[]
9-
====
7+
include::shared:partial$enterprise-feature.adoc[]
108

119
This guide provides step-by-step procedures for emergency failover when your primary Redpanda cluster becomes unavailable. Follow these procedures only during active disasters when immediate failover is required.
1210

1311
// TODO: All command output examples in this guide need verification by running actual commands in test environment
1412

1513
[IMPORTANT]
1614
====
17-
This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:deploy:redpanda/manual/resilience/shadowing.adoc[]. Ensure you have completed the xref:deploy:redpanda/manual/resilience/shadowing.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.
15+
This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:failover.adoc[Planned Failover]. Ensure you have completed the disaster readiness checklist in xref:monitor.adoc#disaster-readiness-checklist[Monitoring and Operations] before an emergency occurs.
1816
====
1917

2018
== Emergency failover procedure
Lines changed: 130 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,130 @@
1+
= Failover
2+
:description: Execute Redpanda Disaster Recovery (Shadowing) failover procedures to transform shadow topics into fully writable resources during disasters.
3+
:page-categories: Management, High Availability, Disaster Recovery
4+
5+
include::shared:partial$enterprise-feature.adoc[]
6+
7+
include::shared:partial$emergency-shadowing-callout.adoc[]
8+
9+
Failover is the process of modifying shadow topics or an entire shadow cluster from read-only replicas to fully writable resources, and ceasing replication from the source cluster. You can fail over individual topics for selective workload migration or fail over the entire cluster for comprehensive disaster recovery. This critical operation transforms your shadow resources into operational production assets, allowing you to redirect application traffic when the source cluster becomes unavailable.
10+
11+
== Failover behavior
12+
13+
When you initiate failover, Redpanda performs the following operations:
14+
15+
1. **Stops replication**: Halts all data fetching from the source cluster for the specified topics or entire shadow link
16+
2. **Failover topics**: Converts read-only shadow topics into regular, writable topics
17+
3. **Updates topic state**: Changes topic status from `ACTIVE` to `FAILING_OVER`, then `FAILED_OVER`
18+
19+
Topic failover is irreversible. Once failed over, topics cannot return to shadow mode, and automatic fallback to the original source cluster is not supported.
20+
21+
== Failover granularity options
22+
23+
You can perform failover at two levels of granularity:
24+
25+
**Individual topic failover:**
26+
[,bash]
27+
----
28+
rpk shadow failover <shadow-link-name> --topic <topic-name>
29+
----
30+
31+
This failover applies only to the specified shadow topic, while leaving other topics in the shadow link still replicating. Use this approach when you need to selectively failover specific workloads or when testing failover procedures.
32+
33+
**Complete shadow link failover (cluster failover):**
34+
[,bash]
35+
----
36+
rpk shadow failover <shadow-link-name> --all
37+
----
38+
39+
This failover applies to all shadow topics associated with the shadow link simultaneously, effectively failing over the entire cluster's replicated data. Use this approach during a complete regional disaster when you need to activate the entire shadow cluster as your new production environment.
40+
41+
**Force delete shadow link (emergency failover):**
42+
[,bash]
43+
----
44+
rpk shadow delete <shadow-link-name> --force
45+
----
46+
47+
[WARNING]
48+
====
49+
Force deleting a shadow link is irreversible and immediately fails over all topics in the link, bypassing the normal failover state transitions. This action should only be used as a last resort when topics are stuck in transitional states and you need immediate access to all replicated data.
50+
====
51+
52+
== Failover states
53+
54+
=== Shadow link states
55+
56+
The shadow link itself has a simple state model:
57+
58+
* **`ACTIVE`**: Shadow link is operating normally, replicating data
59+
60+
Shadow links do not have dedicated failover states. Instead, the link's operational status is determined by the collective state of its shadow topics.
61+
62+
=== Shadow topic states
63+
64+
Individual shadow topics progress through specific states during failover:
65+
66+
* **`ACTIVE`**: Normal replication state before failover
67+
* **`FAULTED`**: Shadow topic has encountered an error and is not replicating
68+
* **`FAILING_OVER`**: Failover initiated, replication stopping
69+
* **`FAILED_OVER`**: Failover completed successfully, topic fully writable
70+
71+
=== Monitor failover progress
72+
73+
Monitor failover progress using the status command:
74+
75+
[,bash]
76+
----
77+
rpk shadow status <shadow-link-name>
78+
----
79+
80+
The output shows individual topic states and any issues encountered during the failover process.
81+
82+
**Task states during monitoring:**
83+
84+
* **`ACTIVE`**: Task is operating normally and replicating data
85+
* **`FAULTED`**: Task encountered an error and requires attention
86+
* **`NOT_RUNNING`**: Task is not currently executing
87+
* **`LINK_UNAVAILABLE`**: Task cannot communicate with the source cluster
88+
89+
== Post-failover cluster behavior
90+
91+
After successful failover, your shadow cluster exhibits the following characteristics:
92+
93+
**Topic accessibility:**
94+
95+
* Failed over topics become fully writable and readable.
96+
* Applications can produce and consume messages normally.
97+
* All Kafka APIs are available for failedover topics.
98+
* Original offsets and timestamps are preserved.
99+
100+
**Shadow link status:**
101+
102+
* The shadow link remains but stops replicating data.
103+
* Link status shows topics in `FAILED_OVER` state.
104+
* You can safely delete the shadow link after successful failover.
105+
106+
**Operational limitations:**
107+
108+
* No automatic fallback mechanism to the original source cluster.
109+
* Data transforms remain disabled until you manually re-enable them.
110+
* Audit log history from the source cluster is not available (new audit logs begin immediately).
111+
112+
== Failover considerations and limitations
113+
114+
**Data consistency:**
115+
116+
* Some data loss may occur due to replication lag at the time of failover.
117+
* Consumer group offsets are preserved, allowing applications to resume from their last committed position.
118+
* In-flight transactions at the source cluster are not replicated and will be lost.
119+
120+
**Recovery-point-objective (RPO):**
121+
122+
The amount of potential data loss depends on replication lag when disaster occurs. Monitor lag metrics to understand your effective RPO.
123+
124+
**Network partitions:**
125+
126+
If the source cluster becomes accessible again after failover, do not attempt to write to both clusters simultaneously. This creates a scenario with potential data inconsistencies, since metadata starts to diverge.
127+
128+
**Testing requirements:**
129+
130+
Regularly test failover procedures in non-production environments to validate your disaster recovery processes and measure RTO.
Lines changed: 78 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
= Disaster Recovery (Shadowing)
2+
:description: Set up disaster recovery for Redpanda clusters using Shadowing for cross-region replication.
3+
:page-aliases: deploy:redpanda/manual/resilience/shadowing.adoc
4+
:env-linux: true
5+
:page-categories: Management, High Availability, Disaster Recovery
6+
7+
include::shared:partial$enterprise-feature.adoc[]
8+
9+
Shadowing is Redpanda's enterprise-grade disaster recovery solution that establishes asynchronous, offset-preserving replication between two distinct Redpanda clusters. A cluster is able to create a dedicated client that continuously replicates source cluster data, including offsets, timestamps, and cluster metadata. This creates a read-only shadow cluster that you can quickly failover to handle production traffic during a disaster.
10+
11+
include::shared:partial$emergency-shadowing-callout.adoc[]
12+
13+
Unlike traditional replication tools that re-produce messages, Shadowing copies data at the byte level, ensuring shadow topics contain identical copies of source topics with preserved offsets and timestamps.
14+
15+
Shadowing replicates:
16+
17+
* **Topic data**: All records with preserved offsets and timestamps
18+
* **Topic configurations**: Partition counts, retention policies, and other xref:reference:properties/topic-properties.adoc[topic properties]
19+
* **Consumer group offsets**: Enables seamless consumer resumption after failover
20+
* **Access Control Lists (ACLs)**: User permissions and security policies
21+
* **Schema Registry data**: Schema definitions and compatibility settings
22+
23+
== How Shadowing fits into disaster recovery
24+
25+
Shadowing addresses enterprise disaster recovery requirements driven by regulatory compliance and business continuity needs. Organizations typically want to minimize both recovery time objective (RTO) and recovery point objective (RPO), and Shadowing asynchronous replication helps you achieve both goals by reducing data loss during regional outages and enabling rapid application recovery.
26+
27+
The architecture follows an active-passive pattern. The source cluster processes all production traffic while the shadow cluster remains in read-only mode, continuously receiving updates. If a disaster occurs, you can failover the shadow topics using the Admin API or `rpk`, making them fully writable. At that point, you can redirect your applications to the shadow cluster, which becomes the new production cluster.
28+
29+
Shadowing complements Redpanda's existing availability and recovery capabilities. xref:deploy:redpanda/manual/high-availability.adoc[High availability] actively protects your day-to-day operations, handling reads and writes seamlessly during node or availability zone failures within a region. Shadowing is your safety net for catastrophic regional disasters. While xref:whole-cluster-restore.adoc[Whole Cluster Restore] provides point-in-time recovery from xref:manage:tiered-storage.adoc[Tiered Storage], Shadowing delivers near real-time, cross-region replication for mission-critical applications that require rapid failover with minimal data loss.
30+
31+
// TODO: insert diagram. Possibly with a .gif animation showing cluster Cluster A being written and cluster B being replicated with a data flow arrow and geo-separation. Diagram must show Icons or labels for topics, configurations, offsets, ACLs, schemas that are being copied
32+
33+
== Limitations
34+
35+
Shadowing is designed for active-passive disaster recovery scenarios. Each shadow cluster can maintain only one shadow link.
36+
37+
Shadowing operates exclusively in asynchronous mode and doesn't support active-active configurations. This means there will always be some replication lag. You cannot write to both clusters simultaneously.
38+
39+
xref:develop:data-transforms/index.adoc[Data transforms] are disabled on shadow clusters while Shadowing is active. During a disaster, xref:manage:audit-logging.adoc[audit log] history from the source cluster is lost, though the shadow cluster begins generating new audit logs immediately after the failover.
40+
41+
After you failover shadow topics, automatic fallback to the original source cluster is not supported.
42+
43+
[CAUTION]
44+
====
45+
Do not modify synced topic properties on shadow topics. These properties revert to source topic values.
46+
====
47+
48+
== Setup and Configuration
49+
50+
Choose your implementation approach:
51+
52+
* **xref:setup.adoc[Setup and Configuration]** - Initial shadow configuration, authentication, and topic selection
53+
* **xref:monitor.adoc[Monitoring and Operations]** - Health checks, lag monitoring, and operational procedures
54+
* **xref:failover.adoc[Planned Failover]** - Controlled disaster recovery testing and migrations
55+
* **xref:emergency-failover.adoc[Emergency Failover]** - Rapid disaster response procedures
56+
57+
== Disaster readiness checklist
58+
59+
Before a disaster occurs, ensure you have:
60+
61+
* [ ] Access to shadow cluster administrative credentials
62+
* [ ] Shadow link names and configuration details, and networking documented
63+
* [ ] Application connection strings for the shadow cluster prepared
64+
* [ ] Tested failover procedures in a non-production environment
65+
66+
== Next steps
67+
68+
After setting up Shadowing for your Redpanda clusters, consider these additional steps:
69+
70+
* **Test your disaster recovery procedures**: Regularly practice failover scenarios in a non-production environment. See xref:emergency-failover.adoc[Emergency Failover] for step-by-step emergency procedures.
71+
72+
* **Monitor shadow link health**: Set up alerting on the metrics described above to ensure early detection of replication issues.
73+
74+
* **Implement automated failover**: Consider developing automation scripts that can detect outages and initiate failover based on predefined criteria.
75+
76+
* **Review security policies**: Ensure your ACL filters replicate the appropriate security settings for your disaster recovery environment.
77+
78+
* **Document your configuration**: Maintain up-to-date documentation of your shadow link configuration, including network settings, authentication details, and filter definitions.
Lines changed: 115 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,115 @@
1+
= Monitor Shadowing
2+
:description: Monitor Redpanda Disaster Recovery (Shadowing) health with status commands, metrics, and best practices for tracking replication performance.
3+
:page-categories: Management, Monitoring, Disaster Recovery
4+
5+
include::shared:partial$enterprise-feature.adoc[]
6+
7+
Monitor your shadow links to ensure proper replication performance and understand your disaster recovery readiness. Use `rpk` commands, metrics, and status information to track shadow link health and troubleshoot issues.
8+
9+
== Status commands
10+
11+
List existing shadow links:
12+
13+
[,bash]
14+
----
15+
rpk shadow list
16+
----
17+
18+
View shadow link configuration details:
19+
20+
[,bash]
21+
----
22+
rpk shadow describe <my-disaster-recovery-link>
23+
----
24+
25+
This command shows the complete configuration of the shadow link, including connection settings, filters, and synchronization options.
26+
27+
Check your shadow link status to ensure proper operation:
28+
29+
[,bash]
30+
----
31+
rpk shadow status <shadow-link-name>
32+
----
33+
34+
**Status command options:**
35+
36+
[,bash]
37+
----
38+
rpk shadow status <shadow-link-name>
39+
----
40+
41+
For troubleshooting specific issues, you can use command options to show individual status sections. See the rpk reference for available status options.
42+
43+
The status output includes:
44+
45+
* **Shadow link state**: Overall operational state (`ACTIVE`)
46+
* **Individual topic states**: Current state of each replicated topic (`ACTIVE`, `FAULTED`, `FAILING_OVER`, `FAILED_OVER`)
47+
* **Task status**: Health of replication tasks across brokers (`ACTIVE`, `FAULTED`, `NOT_RUNNING`, `LINK_UNAVAILABLE`)
48+
* **Lag information**: Replication lag per partition showing source vs shadow watermarks
49+
50+
[[shadow-link-metrics]]
51+
== Metrics
52+
53+
Shadowing provides comprehensive metrics to track replication performance and health:
54+
55+
[cols="1,1,2"]
56+
|===
57+
|Metric |Type |Description
58+
59+
|`redpanda_shadow_link_shadow_lag`
60+
|Gauge
61+
|The lag of the shadow partition against the source partition, calculated as source partition LSO minus shadow partition HWM. Monitor by `shadow_link_name`, `topic`, and `partition` to understand replication lag for each partition.
62+
63+
|`redpanda_shadow_link_total_bytes_fetched`
64+
|Count
65+
|The total number of bytes fetched by a sharded replicator (bytes received by the client). Labeled by `shadow_link_name` and `shard` to track data transfer volume from the source cluster.
66+
67+
|`redpanda_shadow_link_total_bytes_written`
68+
|Count
69+
|The total number of bytes written by a sharded replicator (bytes written to the write_at_offset_stm). Uses `shadow_link_name` and `shard` labels to monitor data written to the shadow cluster.
70+
71+
|`redpanda_shadow_link_client_errors`
72+
|Count
73+
|The number of errors seen by the client. Track by `shadow_link_name` and `shard` to identify connection or protocol issues between clusters.
74+
75+
|`redpanda_shadow_link_shadow_topic_state`
76+
|Gauge
77+
|Number of shadow topics in the respective states. Labeled by `shadow_link_name` and `state` to monitor topic state distribution across your shadow links.
78+
79+
|`redpanda_shadow_link_total_records_fetched`
80+
|Count
81+
|The total number of records fetched by the sharded replicator (records received by the client). Monitor by `shadow_link_name` and `shard` to track message throughput from the source.
82+
83+
|`redpanda_shadow_link_total_records_written`
84+
|Count
85+
|The total number of records written by a sharded replicator (records written to the write_at_offset_stm). Uses `shadow_link_name` and `shard` labels to monitor message throughput to the shadow cluster.
86+
|===
87+
88+
See also: xref:reference:public-metrics-reference.adoc[]
89+
90+
== Monitoring best practices
91+
92+
=== Health check procedures
93+
94+
Establish regular monitoring workflows to ensure shadow link health:
95+
96+
**Health checks:**
97+
[,bash]
98+
----
99+
# Check all shadow links are active
100+
rpk shadow list | grep -v "ACTIVE" || echo "All shadow links healthy"
101+
102+
# Monitor lag for critical topics
103+
rpk shadow status <shadow-link-name> | grep -E "LAG|Lag"
104+
----
105+
106+
=== Alert thresholds
107+
108+
Configure monitoring alerts for:
109+
110+
* **High replication lag**: When `redpanda_shadow_link_shadow_lag` exceeds your RPO requirements
111+
* **Connection errors**: When `redpanda_shadow_link_client_errors` increases rapidly
112+
* **Topic state changes**: When topics move to `FAULTED` state
113+
* **Task failures**: When replication tasks enter `FAULTED` or `NOT_RUNNING` states
114+
* **Link unavailability**: When tasks show `LINK_UNAVAILABLE` indicating source cluster connectivity issues
115+
* **Throughput drops**: When bytes/records fetched drops significantly

0 commit comments

Comments
 (0)