Skip to content
Merged
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
24 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 17 additions & 11 deletions modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -84,9 +84,6 @@
***** xref:deploy:redpanda/manual/production/production-deployment-automation.adoc[]
***** xref:deploy:redpanda/manual/production/production-deployment.adoc[]
***** xref:deploy:redpanda/manual/production/production-readiness.adoc[]
**** xref:deploy:redpanda/manual/high-availability.adoc[High Availability]
**** xref:deploy:redpanda/manual/resilience/shadowing.adoc[Shadowing]
**** xref:deploy:redpanda/manual/resilience/shadowing-guide.adoc[]
**** xref:deploy:redpanda/manual/sizing-use-cases.adoc[Sizing Use Cases]
**** xref:deploy:redpanda/manual/sizing.adoc[Sizing Guidelines]
**** xref:deploy:redpanda/manual/linux-system-tuning.adoc[System Tuning]
Expand Down Expand Up @@ -179,9 +176,6 @@
*** xref:manage:tiered-storage.adoc[]
*** xref:manage:fast-commission-decommission.adoc[]
*** xref:manage:mountable-topics.adoc[]
*** xref:manage:remote-read-replicas.adoc[Remote Read Replicas]
*** xref:manage:topic-recovery.adoc[Topic Recovery]
*** xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore]
** xref:manage:iceberg/index.adoc[Iceberg]
*** xref:manage:iceberg/about-iceberg-topics.adoc[About Iceberg Topics]
*** xref:manage:iceberg/specify-iceberg-schema.adoc[Specify Iceberg Schema]
Expand All @@ -199,6 +193,22 @@
*** xref:manage:schema-reg/schema-reg-authorization.adoc[Schema Registry Authorization]
*** xref:manage:schema-reg/schema-id-validation.adoc[]
*** xref:console:ui/schema-reg.adoc[Manage in Redpanda Console]
** xref:deploy:redpanda/manual/high-availability.adoc[High Availability]
** xref:deploy:redpanda/manual/disaster-recovery/index.adoc[Disaster Recovery]
*** xref:deploy:redpanda/manual/disaster-recovery/shadowing/index.adoc[Shadowing]
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/how-shadowing-works.adoc[Concept]
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/setup.adoc[Set Up]
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/monitor.adoc[Monitor]
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/failover.adoc[Failover Concept]
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/setup-failover.adoc[Set Up Failover]
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/disaster-response.adoc[Disaster Response Procedures]
*** xref:deploy:redpanda/manual/disaster-recovery/whole-cluster-restore.adoc[Whole Cluster Restore]
*** xref:deploy:redpanda/manual/disaster-recovery/topic-recovery.adoc[Topic Recovery]
** xref:deploy:redpanda/manual/remote-read-replicas.adoc[Remote Read Replicas]
** xref:manage:recovery-mode.adoc[Recovery Mode]
** xref:manage:rack-awareness.adoc[Rack Awareness]
** xref:manage:raft-group-reconfiguration.adoc[Raft Group Reconfiguration]
** xref:manage:io-optimization.adoc[]
** xref:manage:console/index.adoc[Redpanda Console]
*** xref:console:config/configure-console.adoc[Configure Redpanda Console]
*** xref:console:config/enterprise-license.adoc[Add an Enterprise License]
Expand All @@ -212,12 +222,8 @@
*** xref:console:config/topic-documentation.adoc[Topic Documentation]
*** xref:console:config/analytics.adoc[Telemetry]
*** xref:console:config/kafka-connect.adoc[Kafka Connect]
** xref:manage:recovery-mode.adoc[Recovery Mode]
** xref:manage:rack-awareness.adoc[Rack Awareness]
** xref:manage:monitoring.adoc[]
** xref:manage:io-optimization.adoc[]
** xref:manage:raft-group-reconfiguration.adoc[Raft Group Reconfiguration]
** xref:manage:use-admin-api.adoc[Use the Admin API]
** xref:manage:monitoring.adoc[]
* xref:upgrade:index.adoc[Upgrade]
** xref:upgrade:rolling-upgrade.adoc[Upgrade Redpanda in Linux]
** xref:upgrade:k-rolling-upgrade.adoc[Upgrade Redpanda in Kubernetes]
Expand Down
2 changes: 1 addition & 1 deletion modules/deploy/pages/console/linux/deploy.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This page shows you how to deploy Redpanda Console on Linux using Docker or the

== Prerequisites

* You must have a running Redpanda or Kafka cluster available to connect to. Redpanda Console requires a cluster to function. For instructions on deploying a Redpanda cluster, see xref:deploy:redpanda/manual/index.adoc[].
* You must have a running Redpanda or Kafka cluster available to connect to. Redpanda Console requires a cluster to function. For instructions on deploying a Redpanda cluster, see xref:deploy:redpanda/manual/production/index.adoc[].
* Review the xref:deploy:console/linux/requirements.adoc[system requirements for Redpanda Console on Linux].

== Deploy with Docker
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
= Disaster Recovery
:description: Set up disaster recovery for Redpanda clusters using Shadowing for cross-region replication.
:env-linux: true
:page-layout: index
:page-categories: Management, High Availability, Disaster Recovery
Original file line number Diff line number Diff line change
@@ -1,20 +1,18 @@
= Shadowing Guide
= Disaster Response Procedures
:description: Step-by-step emergency guide for failing over Redpanda shadow links during disasters.
:page-aliases: deploy:redpanda/manual/resilience/shadowing-guide.adoc
:env-linux: true
:page-categories: Management, High Availability, Disaster Recovery, Emergency Response

[NOTE]
====
include::shared:partial$enterprise-license.adoc[]
====

This guide provides step-by-step procedures for emergency failover when your primary Redpanda cluster becomes unavailable. Follow these procedures only during active disasters when immediate failover is required.

// TODO: All command output examples in this guide need verification by running actual commands in test environment

[IMPORTANT]
====
This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:deploy:redpanda/manual/resilience/shadowing.adoc[]. Ensure you have completed the xref:deploy:redpanda/manual/resilience/shadowing.adoc#disaster-readiness-checklist[disaster readiness checklist] before an emergency occurs.
This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:./failover.adoc[]. Ensure you have completed the disaster readiness checklist in xref:./index.adoc#disaster-readiness-checklist[] before an emergency occurs.
====

== Emergency failover procedure
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,98 @@
= Failover Concept
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
= Failover Concept
= Failover

:description: Execute Redpanda Disaster Recovery (Shadowing) failover procedures to transform shadow topics into fully writable resources during disasters.
:page-categories: Management, High Availability, Disaster Recovery

include::shared:partial$enterprise-license.adoc[]

include::shared:partial$emergency-shadowing-callout.adoc[]

Failover is the process of modifying shadow topics or an entire shadow cluster from read-only replicas to fully writable resources, and ceasing replication from the source cluster. You can fail over individual topics for selective workload migration or fail over the entire cluster for comprehensive disaster recovery. This critical operation transforms your shadow resources into operational production assets, allowing you to redirect application traffic when the source cluster becomes unavailable.

== Failover behavior

When you initiate failover, Redpanda performs the following operations:

1. **Stops replication**: Halts all data fetching from the source cluster for the specified topics or entire shadow link
2. **Failover topics**: Converts read-only shadow topics into regular, writable topics
3. **Updates topic state**: Changes topic status from `ACTIVE` to `FAILING_OVER`, then `FAILED_OVER`

Topic failover is irreversible. Once failed over, topics cannot return to shadow mode, and automatic fallback to the original source cluster is not supported.

== Failover granularity options

Redpanda supports failover at different levels of granularity to match your disaster recovery needs:

**Individual topic failover** applies only to specific shadow topics while leaving other topics in the shadow link still replicating. Use this approach when you need to selectively failover specific workloads or when testing failover procedures.

**Complete shadow link failover (cluster failover)** applies to all shadow topics associated with the shadow link simultaneously, effectively failing over the entire cluster's replicated data. Use this approach during a complete regional disaster when you need to activate the entire shadow cluster as your new production environment.

**Force delete shadow link (emergency failover)** is an irreversible operation that immediately fails over all topics in the link, bypassing the normal failover state transitions. This action should only be used as a last resort when topics are stuck in transitional states and you need immediate access to all replicated data.

== Failover states

=== Shadow link states

The shadow link itself has a simple state model:

* **`ACTIVE`**: Shadow link is operating normally, replicating data

Shadow links do not have dedicated failover states. Instead, the link's operational status is determined by the collective state of its shadow topics.

=== Shadow topic states

Individual shadow topics progress through specific states during failover:

* **`ACTIVE`**: Normal replication state before failover
* **`FAULTED`**: Shadow topic has encountered an error and is not replicating
* **`FAILING_OVER`**: Failover initiated, replication stopping
* **`FAILED_OVER`**: Failover completed successfully, topic fully writable



== Post-failover cluster behavior

After successful failover, your shadow cluster exhibits the following characteristics:

**Topic accessibility:**

* Failed over topics become fully writable and readable.
* Applications can produce and consume messages normally.
* All Kafka APIs are available for failedover topics.
* Original offsets and timestamps are preserved.

**Shadow link status:**

* The shadow link remains but stops replicating data.
* Link status shows topics in `FAILED_OVER` state.
* You can safely delete the shadow link after successful failover.

**Operational limitations:**

* No automatic fallback mechanism to the original source cluster.
* Data transforms remain disabled until you manually re-enable them.
* Audit log history from the source cluster is not available (new audit logs begin immediately).

== Failover considerations and limitations

**Data consistency:**

* Some data loss may occur due to replication lag at the time of failover.
* Consumer group offsets are preserved, allowing applications to resume from their last committed position.
* In-flight transactions at the source cluster are not replicated and will be lost.

**Recovery-point-objective (RPO):**

The amount of potential data loss depends on replication lag when disaster occurs. Monitor lag metrics to understand your effective RPO.

**Network partitions:**

If the source cluster becomes accessible again after failover, do not attempt to write to both clusters simultaneously. This creates a scenario with potential data inconsistencies, since metadata starts to diverge.

**Testing requirements:**

Regularly test failover procedures in non-production environments to validate your disaster recovery processes and measure RTO.

== Next steps

* **Execute failover**: For step-by-step failover procedures, see xref:./setup-failover.adoc[]
* **Emergency situations**: For rapid disaster response, see xref:./disaster-response.adoc[]
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
= Shadowing Concept
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
= Shadowing Concept
= Shadowing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

:description: Set up disaster recovery for Redpanda clusters using Shadowing for cross-region replication.
:env-linux: true
:page-categories: Management, High Availability, Disaster Recovery

include::shared:partial$enterprise-license.adoc[]

Shadowing is Redpanda's enterprise-grade disaster recovery solution that establishes asynchronous, offset-preserving replication between two distinct Redpanda clusters. A cluster is able to create a dedicated client that continuously replicates source cluster data, including offsets, timestamps, and cluster metadata. This creates a read-only shadow cluster that you can quickly failover to handle production traffic during a disaster.

include::shared:partial$emergency-shadowing-callout.adoc[]

Unlike traditional replication tools that re-produce messages, Shadowing copies data at the byte level, ensuring shadow topics contain identical copies of source topics with preserved offsets and timestamps.

Shadowing replicates:

* **Topic data**: All records with preserved offsets and timestamps
* **Topic configurations**: Partition counts, retention policies, and other xref:reference:properties/topic-properties.adoc[topic properties]
* **Consumer group offsets**: Enables seamless consumer resumption after failover
* **Access Control Lists (ACLs)**: User permissions and security policies
* **Schema Registry data**: Schema definitions and compatibility settings

== How Shadowing fits into disaster recovery

Shadowing addresses enterprise disaster recovery requirements driven by regulatory compliance and business continuity needs. Organizations typically want to minimize both recovery time objective (RTO) and recovery point objective (RPO), and Shadowing asynchronous replication helps you achieve both goals by reducing data loss during regional outages and enabling rapid application recovery.

The architecture follows an active-passive pattern. The source cluster processes all production traffic while the shadow cluster remains in read-only mode, continuously receiving updates. If a disaster occurs, you can failover the shadow topics using the Admin API or `rpk`, making them fully writable. At that point, you can redirect your applications to the shadow cluster, which becomes the new production cluster.

Shadowing complements Redpanda's existing availability and recovery capabilities. xref:deploy:redpanda/manual/high-availability.adoc[High availability] actively protects your day-to-day operations, handling reads and writes seamlessly during node or availability zone failures within a region. Shadowing is your safety net for catastrophic regional disasters. While xref:deploy:redpanda/manual/disaster-recovery/whole-cluster-restore.adoc[Whole Cluster Restore] provides point-in-time recovery from xref:manage:tiered-storage.adoc[Tiered Storage], Shadowing delivers near real-time, cross-region replication for mission-critical applications that require rapid failover with minimal data loss.

// TODO: insert diagram. Possibly with a .gif animation showing cluster Cluster A being written and cluster B being replicated with a data flow arrow and geo-separation. Diagram must show Icons or labels for topics, configurations, offsets, ACLs, schemas that are being copied

== Limitations

Shadowing is designed for active-passive disaster recovery scenarios. Each shadow cluster can maintain only one shadow link.

Shadowing operates exclusively in asynchronous mode and doesn't support active-active configurations. This means there will always be some replication lag. You cannot write to both clusters simultaneously.

xref:develop:data-transforms/index.adoc[Data transforms] are disabled on shadow clusters while Shadowing is active. During a disaster, xref:manage:audit-logging.adoc[audit log] history from the source cluster is lost, though the shadow cluster begins generating new audit logs immediately after the failover.

After you failover shadow topics, automatic fallback to the original source cluster is not supported.

[CAUTION]
====
Do not modify synced topic properties on shadow topics. These properties revert to source topic values.
====

== Setup and Configuration

Choose your implementation approach:

* **xref:./setup.adoc[Setup and Configuration]** - Initial shadow configuration, authentication, and topic selection
* **xref:./monitor.adoc[Monitoring and Operations]** - Health checks, lag monitoring, and operational procedures
* **xref:./failover.adoc[Planned Failover]** - Controlled disaster recovery testing and migrations
* **xref:./disaster-response.adoc[Disaster response procedures]** - Rapid disaster response procedures

== Disaster readiness checklist

Before a disaster occurs, ensure you have:

* [ ] Access to shadow cluster administrative credentials
* [ ] Shadow link names and configuration details, and networking documented
* [ ] Application connection strings for the shadow cluster prepared
* [ ] Tested failover procedures in a non-production environment

== Next steps

After setting up Shadowing for your Redpanda clusters, consider these additional steps:

* **Test your disaster recovery procedures**: Regularly practice failover scenarios in a non-production environment. See xref:./disaster-response.adoc[] for step-by-step disaster procedures.

* **Monitor shadow link health**: Set up alerting on the metrics described above to ensure early detection of replication issues.

* **Implement automated failover**: Consider developing automation scripts that can detect outages and initiate failover based on predefined criteria.

* **Review security policies**: Ensure your ACL filters replicate the appropriate security settings for your disaster recovery environment.

* **Document your configuration**: Maintain up-to-date documentation of your shadow link configuration, including network settings, authentication details, and filter definitions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
= Shadowing
:description: Set up disaster recovery for Redpanda clusters using Shadowing for cross-region replication.
:env-linux: true
:page-layout: index
:page-categories: Management, High Availability, Disaster Recovery

Loading