|
| 1 | += Failover |
| 2 | +:description: Execute Redpanda Disaster Recovery (Shadowing) failover procedures to transform shadow topics into fully writable resources during disasters. |
| 3 | +:page-categories: Management, High Availability, Disaster Recovery |
| 4 | + |
| 5 | +include::shared:partial$enterprise-feature.adoc[] |
| 6 | + |
| 7 | +include::shared:partial$emergency-shadowing-callout.adoc[] |
| 8 | + |
| 9 | +Failover is the process of modifying shadow topics or an entire shadow cluster from read-only replicas to fully writable resources, and ceasing replication from the source cluster. You can fail over individual topics for selective workload migration or fail over the entire cluster for comprehensive disaster recovery. This critical operation transforms your shadow resources into operational production assets, allowing you to redirect application traffic when the source cluster becomes unavailable. |
| 10 | + |
| 11 | +== Failover behavior |
| 12 | + |
| 13 | +When you initiate failover, Redpanda performs the following operations: |
| 14 | + |
| 15 | +1. **Stops replication**: Halts all data fetching from the source cluster for the specified topics or entire shadow link |
| 16 | +2. **Failover topics**: Converts read-only shadow topics into regular, writable topics |
| 17 | +3. **Updates topic state**: Changes topic status from `ACTIVE` to `FAILING_OVER`, then `FAILED_OVER` |
| 18 | + |
| 19 | +Topic failover is irreversible. Once failed over, topics cannot return to shadow mode, and automatic fallback to the original source cluster is not supported. |
| 20 | + |
| 21 | +== Failover granularity options |
| 22 | + |
| 23 | +You can perform failover at two levels of granularity: |
| 24 | + |
| 25 | +**Individual topic failover:** |
| 26 | +[,bash] |
| 27 | +---- |
| 28 | +rpk shadow failover <shadow-link-name> --topic <topic-name> |
| 29 | +---- |
| 30 | + |
| 31 | +This failover applies only to the specified shadow topic, while leaving other topics in the shadow link still replicating. Use this approach when you need to selectively failover specific workloads or when testing failover procedures. |
| 32 | + |
| 33 | +**Complete shadow link failover (cluster failover):** |
| 34 | +[,bash] |
| 35 | +---- |
| 36 | +rpk shadow failover <shadow-link-name> --all |
| 37 | +---- |
| 38 | + |
| 39 | +This failover applies to all shadow topics associated with the shadow link simultaneously, effectively failing over the entire cluster's replicated data. Use this approach during a complete regional disaster when you need to activate the entire shadow cluster as your new production environment. |
| 40 | + |
| 41 | +**Force delete shadow link (emergency failover):** |
| 42 | +[,bash] |
| 43 | +---- |
| 44 | +rpk shadow delete <shadow-link-name> --force |
| 45 | +---- |
| 46 | + |
| 47 | +[WARNING] |
| 48 | +==== |
| 49 | +Force deleting a shadow link is irreversible and immediately fails over all topics in the link, bypassing the normal failover state transitions. This action should only be used as a last resort when topics are stuck in transitional states and you need immediate access to all replicated data. |
| 50 | +==== |
| 51 | + |
| 52 | +== Failover states |
| 53 | + |
| 54 | +=== Shadow link states |
| 55 | + |
| 56 | +The shadow link itself has a simple state model: |
| 57 | + |
| 58 | +* **`ACTIVE`**: Shadow link is operating normally, replicating data |
| 59 | + |
| 60 | +Shadow links do not have dedicated failover states. Instead, the link's operational status is determined by the collective state of its shadow topics. |
| 61 | + |
| 62 | +=== Shadow topic states |
| 63 | + |
| 64 | +Individual shadow topics progress through specific states during failover: |
| 65 | + |
| 66 | +* **`ACTIVE`**: Normal replication state before failover |
| 67 | +* **`FAULTED`**: Shadow topic has encountered an error and is not replicating |
| 68 | +* **`FAILING_OVER`**: Failover initiated, replication stopping |
| 69 | +* **`FAILED_OVER`**: Failover completed successfully, topic fully writable |
| 70 | + |
| 71 | +=== Monitor failover progress |
| 72 | + |
| 73 | +Monitor failover progress using the status command: |
| 74 | + |
| 75 | +[,bash] |
| 76 | +---- |
| 77 | +rpk shadow status <shadow-link-name> |
| 78 | +---- |
| 79 | + |
| 80 | +The output shows individual topic states and any issues encountered during the failover process. |
| 81 | + |
| 82 | +**Task states during monitoring:** |
| 83 | + |
| 84 | +* **`ACTIVE`**: Task is operating normally and replicating data |
| 85 | +* **`FAULTED`**: Task encountered an error and requires attention |
| 86 | +* **`NOT_RUNNING`**: Task is not currently executing |
| 87 | +* **`LINK_UNAVAILABLE`**: Task cannot communicate with the source cluster |
| 88 | + |
| 89 | +== Post-failover cluster behavior |
| 90 | + |
| 91 | +After successful failover, your shadow cluster exhibits the following characteristics: |
| 92 | + |
| 93 | +**Topic accessibility:** |
| 94 | + |
| 95 | +* Failed over topics become fully writable and readable. |
| 96 | +* Applications can produce and consume messages normally. |
| 97 | +* All Kafka APIs are available for failedover topics. |
| 98 | +* Original offsets and timestamps are preserved. |
| 99 | + |
| 100 | +**Shadow link status:** |
| 101 | + |
| 102 | +* The shadow link remains but stops replicating data. |
| 103 | +* Link status shows topics in `FAILED_OVER` state. |
| 104 | +* You can safely delete the shadow link after successful failover. |
| 105 | + |
| 106 | +**Operational limitations:** |
| 107 | + |
| 108 | +* No automatic fallback mechanism to the original source cluster. |
| 109 | +* Data transforms remain disabled until you manually re-enable them. |
| 110 | +* Audit log history from the source cluster is not available (new audit logs begin immediately). |
| 111 | + |
| 112 | +== Failover considerations and limitations |
| 113 | + |
| 114 | +**Data consistency:** |
| 115 | + |
| 116 | +* Some data loss may occur due to replication lag at the time of failover. |
| 117 | +* Consumer group offsets are preserved, allowing applications to resume from their last committed position. |
| 118 | +* In-flight transactions at the source cluster are not replicated and will be lost. |
| 119 | + |
| 120 | +**Recovery-point-objective (RPO):** |
| 121 | + |
| 122 | +The amount of potential data loss depends on replication lag when disaster occurs. Monitor lag metrics to understand your effective RPO. |
| 123 | + |
| 124 | +**Network partitions:** |
| 125 | + |
| 126 | +If the source cluster becomes accessible again after failover, do not attempt to write to both clusters simultaneously. This creates a scenario with potential data inconsistencies, since metadata starts to diverge. |
| 127 | + |
| 128 | +**Testing requirements:** |
| 129 | + |
| 130 | +Regularly test failover procedures in non-production environments to validate your disaster recovery processes and measure RTO. |
0 commit comments