redpanda-data
diff --git a/‎modules/ROOT/nav.adoc‎
Lines changed: 10 additions & 5 deletions b/‎modules/ROOT/nav.adoc‎
Lines changed: 10 additions & 5 deletions
diff --git a/‎modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc‎ renamed to ‎modules/deploy/pages/redpanda/manual/resilience/disaster-recovery/emergency-failover.adoc‎
Lines changed: 4 additions & 6 deletions b/‎modules/deploy/pages/redpanda/manual/resilience/shadowing-guide.adoc‎ renamed to ‎modules/deploy/pages/redpanda/manual/resilience/disaster-recovery/emergency-failover.adoc‎
Lines changed: 4 additions & 6 deletions
diff --git a/‎modules/deploy/pages/redpanda/manual/resilience/disaster-recovery/failover.adoc‎
Lines changed: 130 additions & 0 deletions b/‎modules/deploy/pages/redpanda/manual/resilience/disaster-recovery/failover.adoc‎
Lines changed: 130 additions & 0 deletions
diff --git a/‎modules/deploy/pages/redpanda/manual/resilience/disaster-recovery/index.adoc‎
Lines changed: 78 additions & 0 deletions b/‎modules/deploy/pages/redpanda/manual/resilience/disaster-recovery/index.adoc‎
Lines changed: 78 additions & 0 deletions
diff --git a/‎modules/deploy/pages/redpanda/manual/resilience/disaster-recovery/monitor.adoc‎
Lines changed: 115 additions & 0 deletions b/‎modules/deploy/pages/redpanda/manual/resilience/disaster-recovery/monitor.adoc‎
Lines changed: 115 additions & 0 deletions
@@ -85,8 +85,15 @@
 ***** xref:deploy:redpanda/manual/production/production-deployment.adoc[]
 ***** xref:deploy:redpanda/manual/production/production-readiness.adoc[]
 **** xref:deploy:redpanda/manual/high-availability.adoc[High Availability]
-**** xref:deploy:redpanda/manual/resilience/shadowing.adoc[Shadowing]
-**** xref:deploy:redpanda/manual/resilience/shadowing-guide.adoc[]
+**** xref:deploy:redpanda/manual/resilience/index.adoc[Resilience]
+***** xref:deploy:redpanda/manual/resilience/disaster-recovery/index.adoc[Disaster Recovery]
+***** xref:deploy:redpanda/manual/resilience/disaster-recovery/setup.adoc[Setup and Configuration]
+***** xref:deploy:redpanda/manual/resilience/disaster-recovery/monitor.adoc[Monitoring and Operations]
+***** xref:deploy:redpanda/manual/resilience/disaster-recovery/failover.adoc[Planned Failover]
+***** xref:deploy:redpanda/manual/resilience/disaster-recovery/emergency-failover.adoc[Emergency Failover]
+***** xref:deploy:redpanda/manual/resilience/whole-cluster-restore.adoc[Whole Cluster Restore]
+***** xref:deploy:redpanda/manual/resilience/topic-recovery.adoc[Topic Recovery]
+***** xref:deploy:redpanda/manual/resilience/remote-read-replicas.adoc[Remote Read Replicas]
 **** xref:deploy:redpanda/manual/sizing-use-cases.adoc[Sizing Use Cases]
 **** xref:deploy:redpanda/manual/sizing.adoc[Sizing Guidelines]
 **** xref:deploy:redpanda/manual/linux-system-tuning.adoc[System Tuning]
@@ -179,9 +186,7 @@
 *** xref:manage:tiered-storage.adoc[]
 *** xref:manage:fast-commission-decommission.adoc[]
 *** xref:manage:mountable-topics.adoc[]
-*** xref:manage:remote-read-replicas.adoc[Remote Read Replicas]
-*** xref:manage:topic-recovery.adoc[Topic Recovery]
-*** xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore]
+
 ** xref:manage:iceberg/index.adoc[Iceberg]
 *** xref:manage:iceberg/about-iceberg-topics.adoc[About Iceberg Topics]
 *** xref:manage:iceberg/specify-iceberg-schema.adoc[Specify Iceberg Schema]
 
@@ -1,20 +1,18 @@
-= Shadowing Guide
+= Emergency Failover  
 :description: Step-by-step emergency guide for failing over Redpanda shadow links during disasters.
+:page-aliases: deploy:redpanda/manual/resilience/shadowing-guide.adoc
 :env-linux: true
 :page-categories: Management, High Availability, Disaster Recovery, Emergency Response
 
-[NOTE]
-====
-include::shared:partial$enterprise-license.adoc[]
-====
+include::shared:partial$enterprise-feature.adoc[]
 
 This guide provides step-by-step procedures for emergency failover when your primary Redpanda cluster becomes unavailable. Follow these procedures only during active disasters when immediate failover is required.
 
 // TODO: All command output examples in this guide need verification by running actual commands in test environment
 
 [IMPORTANT]
 ====
-This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:deploy:redpanda/manual/resilience/shadowing.adoc[]. Ensure you have completed the xref:deploy:redpanda/manual/resilience/shadowing.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.
+This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:failover.adoc[Planned Failover]. Ensure you have completed the disaster readiness checklist in xref:monitor.adoc#disaster-readiness-checklist[Monitoring and Operations] before an emergency occurs.
 ====
 
 == Emergency failover procedure
 
@@ -0,0 +1,130 @@
+= Failover
+:description: Execute Redpanda Disaster Recovery (Shadowing) failover procedures to transform shadow topics into fully writable resources during disasters.
+:page-categories: Management, High Availability, Disaster Recovery
+
+include::shared:partial$enterprise-feature.adoc[]
+
+include::shared:partial$emergency-shadowing-callout.adoc[]
+
+Failover is the process of modifying shadow topics or an entire shadow cluster from read-only replicas to fully writable resources, and ceasing replication from the source cluster. You can fail over individual topics for selective workload migration or fail over the entire cluster for comprehensive disaster recovery. This critical operation transforms your shadow resources into operational production assets, allowing you to redirect application traffic when the source cluster becomes unavailable.
+
+== Failover behavior
+
+When you initiate failover, Redpanda performs the following operations:
+
+1. **Stops replication**: Halts all data fetching from the source cluster for the specified topics or entire shadow link
+2. **Failover topics**: Converts read-only shadow topics into regular, writable topics
+3. **Updates topic state**: Changes topic status from `ACTIVE` to `FAILING_OVER`, then `FAILED_OVER`
+
+Topic failover is irreversible. Once failed over, topics cannot return to shadow mode, and automatic fallback to the original source cluster is not supported.
+
+== Failover granularity options
+
+You can perform failover at two levels of granularity:
+
+**Individual topic failover:**
+[,bash]
+----
+rpk shadow failover <shadow-link-name> --topic <topic-name>
+----
+
+This failover applies only to the specified shadow topic, while leaving other topics in the shadow link still replicating. Use this approach when you need to selectively failover specific workloads or when testing failover procedures.
+
+**Complete shadow link failover (cluster failover):**
+[,bash]
+----
+rpk shadow failover <shadow-link-name> --all
+----
+
+This failover applies to all shadow topics associated with the shadow link simultaneously, effectively failing over the entire cluster's replicated data. Use this approach during a complete regional disaster when you need to activate the entire shadow cluster as your new production environment.
+
+**Force delete shadow link (emergency failover):**
+[,bash]
+----
+rpk shadow delete <shadow-link-name> --force
+----
+
+[WARNING]
+====
+Force deleting a shadow link is irreversible and immediately fails over all topics in the link, bypassing the normal failover state transitions. This action should only be used as a last resort when topics are stuck in transitional states and you need immediate access to all replicated data.
+====
+
+== Failover states
+
+=== Shadow link states
+
+The shadow link itself has a simple state model:
+
+* **`ACTIVE`**: Shadow link is operating normally, replicating data
+
+Shadow links do not have dedicated failover states. Instead, the link's operational status is determined by the collective state of its shadow topics.
+
+=== Shadow topic states
+
+Individual shadow topics progress through specific states during failover:
+
+* **`ACTIVE`**: Normal replication state before failover
+* **`FAULTED`**: Shadow topic has encountered an error and is not replicating
+* **`FAILING_OVER`**: Failover initiated, replication stopping
+* **`FAILED_OVER`**: Failover completed successfully, topic fully writable
+
+=== Monitor failover progress
+
+Monitor failover progress using the status command:
+
+[,bash]
+----
+rpk shadow status <shadow-link-name>
+----
+
+The output shows individual topic states and any issues encountered during the failover process.
+
+**Task states during monitoring:**
+
+* **`ACTIVE`**: Task is operating normally and replicating data
+* **`FAULTED`**: Task encountered an error and requires attention
+* **`NOT_RUNNING`**: Task is not currently executing
+* **`LINK_UNAVAILABLE`**: Task cannot communicate with the source cluster
+
+== Post-failover cluster behavior
+
+After successful failover, your shadow cluster exhibits the following characteristics:
+
+**Topic accessibility:**
+
+* Failed over topics become fully writable and readable.
+* Applications can produce and consume messages normally.
+* All Kafka APIs are available for failedover topics.
+* Original offsets and timestamps are preserved.
+
+**Shadow link status:**
+
+* The shadow link remains but stops replicating data.
+* Link status shows topics in `FAILED_OVER` state.
+* You can safely delete the shadow link after successful failover.
+
+**Operational limitations:**
+
+* No automatic fallback mechanism to the original source cluster.
+* Data transforms remain disabled until you manually re-enable them.
+* Audit log history from the source cluster is not available (new audit logs begin immediately).
+
+== Failover considerations and limitations
+
+**Data consistency:**
+
+* Some data loss may occur due to replication lag at the time of failover.
+* Consumer group offsets are preserved, allowing applications to resume from their last committed position.
+* In-flight transactions at the source cluster are not replicated and will be lost.
+
+**Recovery-point-objective (RPO):**
+
+The amount of potential data loss depends on replication lag when disaster occurs. Monitor lag metrics to understand your effective RPO.
+
+**Network partitions:**
+
+If the source cluster becomes accessible again after failover, do not attempt to write to both clusters simultaneously. This creates a scenario with potential data inconsistencies, since metadata starts to diverge.
+
+**Testing requirements:**
+
+Regularly test failover procedures in non-production environments to validate your disaster recovery processes and measure RTO.
@@ -0,0 +1,78 @@
+= Disaster Recovery (Shadowing)
+:description: Set up disaster recovery for Redpanda clusters using Shadowing for cross-region replication.
+:page-aliases: deploy:redpanda/manual/resilience/shadowing.adoc
+:env-linux: true
+:page-categories: Management, High Availability, Disaster Recovery
+
+include::shared:partial$enterprise-feature.adoc[]
+
+Shadowing is Redpanda's enterprise-grade disaster recovery solution that establishes asynchronous, offset-preserving replication between two distinct Redpanda clusters. A cluster is able to create a dedicated client that continuously replicates source cluster data, including offsets, timestamps, and cluster metadata. This creates a read-only shadow cluster that you can quickly failover to handle production traffic during a disaster.
+
+include::shared:partial$emergency-shadowing-callout.adoc[]
+
+Unlike traditional replication tools that re-produce messages, Shadowing copies data at the byte level, ensuring shadow topics contain identical copies of source topics with preserved offsets and timestamps.
+
+Shadowing replicates:
+
+* **Topic data**: All records with preserved offsets and timestamps
+* **Topic configurations**: Partition counts, retention policies, and other xref:reference:properties/topic-properties.adoc[topic properties]
+* **Consumer group offsets**: Enables seamless consumer resumption after failover
+* **Access Control Lists (ACLs)**: User permissions and security policies
+* **Schema Registry data**: Schema definitions and compatibility settings
+
+== How Shadowing fits into disaster recovery
+
+Shadowing addresses enterprise disaster recovery requirements driven by regulatory compliance and business continuity needs. Organizations typically want to minimize both recovery time objective (RTO) and recovery point objective (RPO), and Shadowing asynchronous replication helps you achieve both goals by reducing data loss during regional outages and enabling rapid application recovery.
+
+The architecture follows an active-passive pattern. The source cluster processes all production traffic while the shadow cluster remains in read-only mode, continuously receiving updates. If a disaster occurs, you can failover the shadow topics using the Admin API or `rpk`, making them fully writable. At that point, you can redirect your applications to the shadow cluster, which becomes the new production cluster.
+
+Shadowing complements Redpanda's existing availability and recovery capabilities. xref:deploy:redpanda/manual/high-availability.adoc[High availability] actively protects your day-to-day operations, handling reads and writes seamlessly during node or availability zone failures within a region. Shadowing is your safety net for catastrophic regional disasters. While xref:whole-cluster-restore.adoc[Whole Cluster Restore] provides point-in-time recovery from xref:manage:tiered-storage.adoc[Tiered Storage], Shadowing delivers near real-time, cross-region replication for mission-critical applications that require rapid failover with minimal data loss.
+
+// TODO: insert diagram. Possibly with a .gif animation showing cluster Cluster A being written and cluster B being replicated with a data flow arrow and geo-separation. Diagram must show Icons or labels for topics, configurations, offsets, ACLs, schemas that are being copied
+
+== Limitations
+
+Shadowing is designed for active-passive disaster recovery scenarios. Each shadow cluster can maintain only one shadow link.
+
+Shadowing operates exclusively in asynchronous mode and doesn't support active-active configurations. This means there will always be some replication lag. You cannot write to both clusters simultaneously.
+
+xref:develop:data-transforms/index.adoc[Data transforms] are disabled on shadow clusters while Shadowing is active. During a disaster, xref:manage:audit-logging.adoc[audit log] history from the source cluster is lost, though the shadow cluster begins generating new audit logs immediately after the failover.
+
+After you failover shadow topics, automatic fallback to the original source cluster is not supported.
+
+[CAUTION]
+====
+Do not modify synced topic properties on shadow topics. These properties revert to source topic values.
+====
+
+== Setup and Configuration
+
+Choose your implementation approach:
+
+* **xref:setup.adoc[Setup and Configuration]** - Initial shadow configuration, authentication, and topic selection
+* **xref:monitor.adoc[Monitoring and Operations]** - Health checks, lag monitoring, and operational procedures
+* **xref:failover.adoc[Planned Failover]** - Controlled disaster recovery testing and migrations
+* **xref:emergency-failover.adoc[Emergency Failover]** - Rapid disaster response procedures
+
+== Disaster readiness checklist
+
+Before a disaster occurs, ensure you have:
+
+* [ ] Access to shadow cluster administrative credentials
+* [ ] Shadow link names and configuration details, and networking documented
+* [ ] Application connection strings for the shadow cluster prepared
+* [ ] Tested failover procedures in a non-production environment
+
+== Next steps
+
+After setting up Shadowing for your Redpanda clusters, consider these additional steps:
+
+* **Test your disaster recovery procedures**: Regularly practice failover scenarios in a non-production environment. See xref:emergency-failover.adoc[Emergency Failover] for step-by-step emergency procedures.
+
+* **Monitor shadow link health**: Set up alerting on the metrics described above to ensure early detection of replication issues.
+
+* **Implement automated failover**: Consider developing automation scripts that can detect outages and initiate failover based on predefined criteria.
+
+* **Review security policies**: Ensure your ACL filters replicate the appropriate security settings for your disaster recovery environment.
+
+* **Document your configuration**: Maintain up-to-date documentation of your shadow link configuration, including network settings, authentication details, and filter definitions.
@@ -0,0 +1,115 @@
+= Monitor Shadowing
+:description: Monitor Redpanda Disaster Recovery (Shadowing) health with status commands, metrics, and best practices for tracking replication performance.
+:page-categories: Management, Monitoring, Disaster Recovery
+
+include::shared:partial$enterprise-feature.adoc[]
+
+Monitor your shadow links to ensure proper replication performance and understand your disaster recovery readiness. Use `rpk` commands, metrics, and status information to track shadow link health and troubleshoot issues.
+
+== Status commands
+
+List existing shadow links:
+
+[,bash]
+----
+rpk shadow list
+----
+
+View shadow link configuration details:
+
+[,bash]
+----
+rpk shadow describe <my-disaster-recovery-link>
+----
+
+This command shows the complete configuration of the shadow link, including connection settings, filters, and synchronization options.
+
+Check your shadow link status to ensure proper operation:
+
+[,bash]
+----
+rpk shadow status <shadow-link-name>
+----
+
+**Status command options:**
+
+[,bash]
+----
+rpk shadow status <shadow-link-name>
+----
+
+For troubleshooting specific issues, you can use command options to show individual status sections. See the rpk reference for available status options.
+
+The status output includes:
+
+* **Shadow link state**: Overall operational state (`ACTIVE`)
+* **Individual topic states**: Current state of each replicated topic (`ACTIVE`, `FAULTED`, `FAILING_OVER`, `FAILED_OVER`)
+* **Task status**: Health of replication tasks across brokers (`ACTIVE`, `FAULTED`, `NOT_RUNNING`, `LINK_UNAVAILABLE`)
+* **Lag information**: Replication lag per partition showing source vs shadow watermarks
+
+[[shadow-link-metrics]]
+== Metrics
+
+Shadowing provides comprehensive metrics to track replication performance and health:
+
+[cols="1,1,2"]
+|===
+|Metric |Type |Description
+
+|`redpanda_shadow_link_shadow_lag`
+|Gauge
+|The lag of the shadow partition against the source partition, calculated as source partition LSO minus shadow partition HWM. Monitor by `shadow_link_name`, `topic`, and `partition` to understand replication lag for each partition.
+
+|`redpanda_shadow_link_total_bytes_fetched`
+|Count
+|The total number of bytes fetched by a sharded replicator (bytes received by the client). Labeled by `shadow_link_name` and `shard` to track data transfer volume from the source cluster.
+
+|`redpanda_shadow_link_total_bytes_written`
+|Count
+|The total number of bytes written by a sharded replicator (bytes written to the write_at_offset_stm). Uses `shadow_link_name` and `shard` labels to monitor data written to the shadow cluster.
+
+|`redpanda_shadow_link_client_errors`
+|Count
+|The number of errors seen by the client. Track by `shadow_link_name` and `shard` to identify connection or protocol issues between clusters.
+
+|`redpanda_shadow_link_shadow_topic_state`
+|Gauge
+|Number of shadow topics in the respective states. Labeled by `shadow_link_name` and `state` to monitor topic state distribution across your shadow links.
+
+|`redpanda_shadow_link_total_records_fetched`
+|Count
+|The total number of records fetched by the sharded replicator (records received by the client). Monitor by `shadow_link_name` and `shard` to track message throughput from the source.
+
+|`redpanda_shadow_link_total_records_written`
+|Count
+|The total number of records written by a sharded replicator (records written to the write_at_offset_stm). Uses `shadow_link_name` and `shard` labels to monitor message throughput to the shadow cluster.
+|===
+
+See also: xref:reference:public-metrics-reference.adoc[]
+
+== Monitoring best practices
+
+=== Health check procedures
+
+Establish regular monitoring workflows to ensure shadow link health:
+
+**Health checks:**
+[,bash]
+----
+# Check all shadow links are active
+rpk shadow list | grep -v "ACTIVE" || echo "All shadow links healthy"
+
+# Monitor lag for critical topics  
+rpk shadow status <shadow-link-name> | grep -E "LAG|Lag"
+----
+
+=== Alert thresholds
+
+Configure monitoring alerts for:
+
+* **High replication lag**: When `redpanda_shadow_link_shadow_lag` exceeds your RPO requirements
+* **Connection errors**: When `redpanda_shadow_link_client_errors` increases rapidly
+* **Topic state changes**: When topics move to `FAULTED` state
+* **Task failures**: When replication tasks enter `FAULTED` or `NOT_RUNNING` states
+* **Link unavailability**: When tasks show `LINK_UNAVAILABLE` indicating source cluster connectivity issues
+* **Throughput drops**: When bytes/records fetched drops significantly