Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
59 commits
Select commit Hold shift + click to select a range
e37c208
dr: adds shadowing docs
paulohtb6 Oct 8, 2025
f215318
adding links
paulohtb6 Oct 9, 2025
56429fc
enterprise feature notice
paulohtb6 Oct 9, 2025
2683e4e
modify how internal topics are handled
paulohtb6 Oct 11, 2025
8eb512f
change tags
paulohtb6 Oct 11, 2025
91c4fd7
update with the latest from core
paulohtb6 Oct 15, 2025
e36940c
expand on filtering
paulohtb6 Oct 15, 2025
1cc2d27
networking
paulohtb6 Oct 15, 2025
2440858
update example
paulohtb6 Oct 15, 2025
8786f88
add runbook
paulohtb6 Oct 15, 2025
a46c4c6
expand on monitoring
paulohtb6 Oct 15, 2025
9e08972
add runbook to nav
paulohtb6 Oct 15, 2025
ada54aa
fix lists
paulohtb6 Oct 15, 2025
501106b
Apply suggestion from @paulohtb6
paulohtb6 Oct 16, 2025
285a4f8
Apply suggestion from @paulohtb6
paulohtb6 Oct 16, 2025
1f7a58a
Update modules/ROOT/nav.adoc
paulohtb6 Oct 16, 2025
47f2164
fix lists
paulohtb6 Oct 16, 2025
cb6019b
Apply suggestions from code review
paulohtb6 Oct 16, 2025
2f4b6a1
rename file
paulohtb6 Oct 17, 2025
49ef78f
Apply suggestions from code review
paulohtb6 Oct 17, 2025
66cbbf7
Update modules/ROOT/nav.adoc
paulohtb6 Oct 17, 2025
447a6cb
add periods
paulohtb6 Oct 17, 2025
ec68d45
update shadowing guide
paulohtb6 Oct 17, 2025
2cdb4b4
review points
paulohtb6 Oct 17, 2025
4e01c64
move disaster readiness checklist
paulohtb6 Oct 17, 2025
7ee511a
periods
paulohtb6 Oct 17, 2025
9348c16
modify lists
paulohtb6 Oct 17, 2025
ad8ddd5
add force delete
paulohtb6 Oct 17, 2025
98b25c5
fix callout
paulohtb6 Oct 17, 2025
ede2676
code review
paulohtb6 Oct 17, 2025
89ee641
adjust wording
paulohtb6 Oct 17, 2025
a82a13f
reorganizes the filtering rules
paulohtb6 Oct 17, 2025
4ea9205
enable tls
paulohtb6 Oct 17, 2025
1167c6c
from promotion to failover
paulohtb6 Oct 17, 2025
3980368
removed paused state
paulohtb6 Oct 17, 2025
9b97a74
adress review points
paulohtb6 Oct 17, 2025
8009a2b
Apply suggestions from code review
paulohtb6 Oct 17, 2025
1613786
Apply suggestions from code review from PM
paulohtb6 Oct 24, 2025
54cbeb9
Code review: style changes
paulohtb6 Oct 24, 2025
54011ae
combine messages
paulohtb6 Oct 24, 2025
0ed2dab
add schema registry and new states
paulohtb6 Oct 24, 2025
2936799
updates replaceable vars and breaking changes
paulohtb6 Oct 24, 2025
1df10dd
add start_offset feature
paulohtb6 Oct 24, 2025
ebd52bd
simplify rpk steps
paulohtb6 Oct 24, 2025
f8d2c70
code review
paulohtb6 Oct 24, 2025
1efebd0
add outputs and todo to check them again
paulohtb6 Oct 24, 2025
8d88936
update client config fields
paulohtb6 Oct 24, 2025
2e822de
code review
paulohtb6 Oct 24, 2025
7924dab
add consumer groups
paulohtb6 Oct 24, 2025
aedbad9
add whats new
paulohtb6 Oct 24, 2025
82147e4
Shadowing new structure (#1409)
paulohtb6 Oct 28, 2025
73500f0
update license feature page
paulohtb6 Oct 31, 2025
16f8977
simplify limitations section
paulohtb6 Oct 31, 2025
a965721
add beta display version
paulohtb6 Oct 31, 2025
9725f69
beta updates
paulohtb6 Oct 31, 2025
cea5b42
Apply suggestions from code review
paulohtb6 Oct 31, 2025
321760d
Apply suggestions from code review
paulohtb6 Oct 31, 2025
49d5601
Rendering issues
paulohtb6 Oct 31, 2025
ae448a0
fix caption
paulohtb6 Oct 31, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 5 additions & 4 deletions antora.yml
Original file line number Diff line number Diff line change
@@ -1,14 +1,15 @@
name: ROOT
title: Self-Managed
version: 25.3
display_version: '25.3 Beta'
start_page: home:index.adoc
prerelease: true
nav:
- modules/ROOT/nav.adoc
asciidoc:
attributes:
# Date of release in the format YYYY-MM-DD
page-release-date: 2025-07-31
page-release-date: 2025-10-31
# Only used in the main branch (latest version)
page-header-data:
order: 2
Expand All @@ -18,16 +19,16 @@ asciidoc:
# Fallback versions
# We try to fetch the latest versions from GitHub at build time
# --
full-version: 25.2.1
full-version: 25.3.1-rc2
latest-redpanda-tag: 'v25.2.1'
latest-console-tag: ''
latest-release-commit: ''
latest-operator-version: ''
operator-beta-tag: ''
helm-beta-tag: ''
latest-redpanda-helm-chart-version: ''
redpanda-beta-version: '25.3.1-rc1'
redpanda-beta-tag: 'v25.3.1-rc1'
redpanda-beta-version: '25.3.1-rc2'
redpanda-beta-tag: 'v25.3.1-rc2'
console-beta-version: ''
console-beta-tag: ''
# --
Expand Down
2 changes: 1 addition & 1 deletion local-antora-playbook.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
site:
title: Redpanda Docs
start_page: 25.2@ROOT:get-started:intro-to-events.adoc
start_page: 25.3@ROOT:get-started:intro-to-events.adoc
url: http://localhost:5002
robots: disallow
keys:
Expand Down
25 changes: 16 additions & 9 deletions modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,6 @@
***** xref:deploy:redpanda/manual/production/production-deployment-automation.adoc[]
***** xref:deploy:redpanda/manual/production/production-deployment.adoc[]
***** xref:deploy:redpanda/manual/production/production-readiness.adoc[]
**** xref:deploy:redpanda/manual/high-availability.adoc[High Availability]
**** xref:deploy:redpanda/manual/sizing-use-cases.adoc[Sizing Use Cases]
**** xref:deploy:redpanda/manual/sizing.adoc[Sizing Guidelines]
**** xref:deploy:redpanda/manual/linux-system-tuning.adoc[System Tuning]
Expand Down Expand Up @@ -177,9 +176,6 @@
*** xref:manage:tiered-storage.adoc[]
*** xref:manage:fast-commission-decommission.adoc[]
*** xref:manage:mountable-topics.adoc[]
*** xref:manage:remote-read-replicas.adoc[Remote Read Replicas]
*** xref:manage:topic-recovery.adoc[Topic Recovery]
*** xref:manage:whole-cluster-restore.adoc[Whole Cluster Restore]
** xref:manage:iceberg/index.adoc[Iceberg]
*** xref:manage:iceberg/about-iceberg-topics.adoc[About Iceberg Topics]
*** xref:manage:iceberg/specify-iceberg-schema.adoc[Specify Iceberg Schema]
Expand All @@ -197,6 +193,21 @@
*** xref:manage:schema-reg/schema-reg-authorization.adoc[Schema Registry Authorization]
*** xref:manage:schema-reg/schema-id-validation.adoc[]
*** xref:console:ui/schema-reg.adoc[Manage in Redpanda Console]
** xref:deploy:redpanda/manual/high-availability.adoc[High Availability]
** xref:deploy:redpanda/manual/disaster-recovery/index.adoc[Disaster Recovery]
*** xref:deploy:redpanda/manual/disaster-recovery/shadowing/index.adoc[Shadowing]
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/overview.adoc[Overview]
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/setup.adoc[Configure Shadowing]
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/monitor.adoc[Monitor Shadowing]
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/failover.adoc[Configure Failover]
**** xref:deploy:redpanda/manual/disaster-recovery/shadowing/failover-runbook.adoc[Failover Runbook]
*** xref:deploy:redpanda/manual/disaster-recovery/whole-cluster-restore.adoc[Whole Cluster Restore]
*** xref:deploy:redpanda/manual/disaster-recovery/topic-recovery.adoc[Topic Recovery]
** xref:deploy:redpanda/manual/remote-read-replicas.adoc[Remote Read Replicas]
** xref:manage:recovery-mode.adoc[Recovery Mode]
** xref:manage:rack-awareness.adoc[Rack Awareness]
** xref:manage:raft-group-reconfiguration.adoc[Raft Group Reconfiguration]
** xref:manage:io-optimization.adoc[]
** xref:manage:console/index.adoc[Redpanda Console]
*** xref:console:config/configure-console.adoc[Configure Redpanda Console]
*** xref:console:config/enterprise-license.adoc[Add an Enterprise License]
Expand All @@ -210,12 +221,8 @@
*** xref:console:config/topic-documentation.adoc[Topic Documentation]
*** xref:console:config/analytics.adoc[Telemetry]
*** xref:console:config/kafka-connect.adoc[Kafka Connect]
** xref:manage:recovery-mode.adoc[Recovery Mode]
** xref:manage:rack-awareness.adoc[Rack Awareness]
** xref:manage:monitoring.adoc[]
** xref:manage:io-optimization.adoc[]
** xref:manage:raft-group-reconfiguration.adoc[Raft Group Reconfiguration]
** xref:manage:use-admin-api.adoc[Use the Admin API]
** xref:manage:monitoring.adoc[]
* xref:upgrade:index.adoc[Upgrade]
** xref:upgrade:rolling-upgrade.adoc[Upgrade Redpanda in Linux]
** xref:upgrade:k-rolling-upgrade.adoc[Upgrade Redpanda in Kubernetes]
Expand Down
2 changes: 1 addition & 1 deletion modules/deploy/pages/console/linux/deploy.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ This page shows you how to deploy Redpanda Console on Linux using Docker or the

== Prerequisites

* You must have a running Redpanda or Kafka cluster available to connect to. Redpanda Console requires a cluster to function. For instructions on deploying a Redpanda cluster, see xref:deploy:redpanda/manual/index.adoc[].
* You must have a running Redpanda or Kafka cluster available to connect to. Redpanda Console requires a cluster to function. For instructions on deploying a Redpanda cluster, see xref:deploy:redpanda/manual/production/index.adoc[].
* Review the xref:deploy:console/linux/requirements.adoc[system requirements for Redpanda Console on Linux].

== Deploy with Docker
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
= Disaster Recovery
:description: Learn about Shadowing with cross-region replication for disaster recovery.
:env-linux: true
:page-layout: index
:page-categories: Management, High Availability, Disaster Recovery
Original file line number Diff line number Diff line change
@@ -0,0 +1,274 @@
= Failover Runbook
:description: Step-by-step emergency guide for failing over Redpanda shadow links during disasters.
:page-aliases: deploy:redpanda/manual/resilience/shadowing-guide.adoc
:env-linux: true
:page-categories: Management, High Availability, Disaster Recovery, Emergency Response

include::shared:partial$enterprise-license.adoc[]

This guide provides step-by-step procedures for emergency failover when your primary Redpanda cluster becomes unavailable. Follow these procedures only during active disasters when immediate failover is required.

// TODO: All command output examples in this guide need verification by running actual commands in test environment

[IMPORTANT]
====
This is an emergency procedure. For planned failover testing or day-to-day shadow link management, see xref:./failover.adoc[]. Ensure you have completed the disaster readiness checklist in xref:./overview.adoc#disaster-readiness-checklist[] before an emergency occurs.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overview link isn't rendering. Best to not use relative links

====

== Emergency failover procedure

Follow these steps during an active disaster:

1. <<assess-situation,Assess the situation>>
2. <<verify-shadow-status,Verify shadow cluster status>>
3. <<document-state,Document current state>>
4. <<initiate-failover,Initiate failover>>
5. <<monitor-progress,Monitor failover progress>>
6. <<update-applications,Update application configuration>>
7. <<verify-functionality,Verify application functionality>>
8. <<cleanup-stabilize,Clean up and stabilize>>

[[assess-situation]]
=== Assess the situation

Confirm that failover is necessary:

[,bash]
----
# Check if the primary cluster is responding
rpk cluster info --brokers prod-cluster-1.example.com:9092,prod-cluster-2.example.com:9092

# If primary cluster is down, check shadow cluster health
rpk cluster info --brokers shadow-cluster-1.example.com:9092,shadow-cluster-2.example.com:9092
----

**Decision point**: If the primary cluster is responsive, consider whether failover is actually needed. Partial outages may not require full disaster recovery.

**Examples that require full failover:**

* Primary cluster is completely unreachable (network partition, regional outage)
* Multiple broker failures preventing writes to critical topics
* Data center failure affecting majority of brokers
* Persistent authentication or authorization failures across the cluster

**Examples that may NOT require failover:**

* Single broker failure with sufficient replicas remaining
* Temporary network connectivity issues affecting some clients
* High latency or performance degradation (but cluster still functional)
* Non-critical topic or partition unavailability

[[verify-shadow-status]]
=== Verify shadow cluster status

Check the health of your shadow links:

[,bash]
----
# List all shadow links
rpk shadow list

# Check the configuration of your shadow link
rpk shadow describe <shadow-link-name>

# Check the status of your disaster recovery link
rpk shadow status <shadow-link-name>
----

Verify that the following conditions exist before proceeding with failover:

* Shadow link state should be `ACTIVE`.
* Topics should be in `ACTIVE` state (not `FAULTED`).
* Replication lag should be reasonable for your RPO requirements.

**Understanding replication lag:**

Use `rpk shadow status <shadow-link-name>` to check lag, which shows the message count difference between source and shadow partitions:

* **Acceptable lag examples**: 0-1000 messages for low-throughput topics, 0-10000 messages for high-throughput topics
* **Concerning lag examples**: Growing lag over 50,000 messages, or lag that continuously increases without recovering
* **Critical lag examples**: Lag exceeding your data loss tolerance (for example, if you can only afford to lose 1 minute of data, lag should represent less than 1 minute of typical message volume)

[[document-state]]
=== Document current state

Record the current lag and status before proceeding:

[,bash]
----
# Capture current status for post-mortem analysis
rpk shadow status <shadow-link-name> > failover-status-$(date +%Y%m%d-%H%M%S).log
----

// TODO: Verify this output format by running actual rpk shadow status command
Example output showing healthy replication before failover:
----
Shadow Link: <shadow-link-name>

Overview:
NAME <shadow-link-name>
UID <uid>
STATE ACTIVE

Tasks:
Name Broker_ID State Reason
<task-name> 1 ACTIVE
<task-name> 2 ACTIVE

Topics:
Name: <topic-name>, State: ACTIVE

Partition SRC_LSO SRC_HWM DST_HWM Lag
0 1234 1468 1456 12
1 2345 2579 2568 11
----

IMPORTANT: Note the replication lag to estimate potential data loss during failover.

[[initiate-failover]]
=== Initiate failover

A complete cluster failover is appropriate If you observe that the source cluster is no longer reachable:

[,bash]
----
# Fail over all topics in the shadow link
rpk shadow failover <shadow-link-name> --all
----

For selective topic failover (when only specific services are affected):

[,bash]
----
# Fail over individual topics
rpk shadow failover <shadow-link-name> --topic <topic-name>
rpk shadow failover <shadow-link-name> --topic <topic-name>
----

[[monitor-progress]]
=== Monitor failover progress

Track the failover process:

[,bash]
----
# Monitor status until all topics show FAILED_OVER
watch -n 5 "rpk shadow status <shadow-link-name>"

# Check detailed topic status and lag during emergency
rpk shadow status <shadow-link-name> --print-topic
----

// TODO: Verify this output format by running actual rpk shadow status command during failover
Example output during successful failover:
----
Shadow Link: <shadow-link-name>

Overview:
NAME <shadow-link-name>
UID <uid>
STATE ACTIVE

Tasks:
Name Broker_ID State Reason
<task-name> 1 ACTIVE
<task-name> 2 ACTIVE

Topics:
Name: <topic-name>, State: FAILED_OVER
Name: <topic-name>, State: FAILED_OVER
Name: <topic-name>, State: FAILING_OVER
----

**Wait for**: All critical topics to reach `FAILED_OVER` state before proceeding.

[[update-applications]]
=== Update application configuration

Redirect your applications to the shadow cluster by updating connection strings in your applications to point to shadow cluster brokers. If using DNS-based service discovery, update DNS records accordingly. Restart applications to pick up new connection settings and verify connectivity from application hosts to shadow cluster.

[[verify-functionality]]
=== Verify application functionality

Test critical application workflows:

[,bash]
----
# Verify applications can produce messages
rpk topic produce <topic-name> --brokers <shadow-cluster-address>:9092

# Verify applications can consume messages
rpk topic consume <topic-name> --brokers <shadow-cluster-address>:9092 --num 1
----

Test message production and consumption, consumer group functionality, and critical business workflows to ensure everything is working properly.

[[cleanup-stabilize]]
=== Clean up and stabilize

After all applications are running normally:

[,bash]
----
# Optional: Delete the shadow link (no longer needed)
rpk shadow delete <shadow-link-name>
----

Document the time of failover initiation and completion, applications affected and recovery times, data loss estimates based on replication lag, and issues encountered during failover.

== Troubleshoot common issues

=== Topics stuck in FAILING_OVER state

**Problem**: Topics remain in `FAILING_OVER` state for extended periods

**Solution**: Check shadow cluster logs for specific error messages and ensure sufficient cluster resources (CPU, memory, disk space) are available on the shadow cluster. Verify network connectivity between shadow cluster nodes and confirm that all shadow topic partitions have elected leaders and the controller partition is properly replicated with an active leader.

If topics remain stuck after addressing these cluster health issues and you need immediate failover, you can force delete the shadow link to failover all topics:

[,bash]
----
# Force delete the shadow link to failover all topics
rpk shadow delete <shadow-link-name> --force
----

[WARNING]
====
Force deleting a shadow link immediately fails over all topics in the link. This action is irreversible and should only be used when topics are stuck and you need immediate access to all replicated data.
====

=== Topics in FAULTED state

**Problem**: Topics show `FAULTED` state and are not replicating

**Solution**: Check for authentication issues, network connectivity problems, or source cluster unavailability. Verify that the shadow link service account still has the required permissions on the source cluster. Review shadow cluster logs for specific error messages about the faulted topics.

=== Application connection failures

**Problem**: Applications cannot connect to shadow cluster after failover

**Solution**: Verify shadow cluster broker endpoints are correct and check security group and firewall rules. Confirm authentication credentials are valid for the shadow cluster and test network connectivity from application hosts.

=== Consumer group offset issues

**Problem**: Consumers start from beginning or wrong positions

**Solution**: Verify consumer group offsets were replicated (check your filters) and use `rpk group describe <group-name>` to check offset positions. If necessary, manually reset offsets to appropriate positions. See link:https://support.redpanda.com/hc/en-us/articles/23499121317399-How-to-manage-consumer-group-offsets-in-Redpanda[How to manage consumer group offsets in Redpanda^] for detailed reset procedures.

== Next steps

After successful failover, focus on recovery planning and process improvement. Begin by assessing the source cluster failure and determining whether to restore the original cluster or permanently promote the shadow cluster as your new primary.

**Immediate recovery planning:**

1. **Assess source cluster**: Determine root cause of the outage
2. **Plan recovery**: Decide whether to restore source cluster or promote shadow cluster permanently
3. **Data synchronization**: Plan how to synchronize any data produced during failover
4. **Fail forward**: Create a new shadow link with the failed over shadow cluster as source to maintain a DR cluster

**Process improvement:**

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, here (and in other places in these files) there should be some introductory sentence before a list.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After identifying the cause and resolving the cluster failure, resume your regular disaster recovery planning tasks, which should include:

1. **Document the incident**: Record timeline, impact, and lessons learned
2. **Update runbooks**: Improve procedures based on what you learned
3. **Test regularly**: Schedule regular disaster recovery drills
4. **Review monitoring**: Ensure monitoring caught the issue appropriately
Loading