Update rolling docs (#71)

apache · Feb 1, 2024 · a006ee5 · a006ee5
1 parent 64f1f37
commit a006ee5
Show file tree

Hide file tree

Showing 7 changed files with 156 additions and 55 deletions.
diff --git a/_data/en/translate.yml b/_data/en/translate.yml
@@ -48,7 +48,7 @@ title_deployment: "Deployment"
 title_config: "Configurations"
 title_rebalance: "Rebalance"
 title_monitoring: "Monitoring"
-title_rolling-update: "Rolling-Update"
+title_rolling-update: "Rolling Restart and Upgrade"
 title_scale-in-out: "Scale-in and Scale-out"
 title_resource-management: "Resource Management"
 title_cold-backup: "Cold Backup"

diff --git a/_data/zh/translate.yml b/_data/zh/translate.yml
@@ -48,7 +48,7 @@ title_deployment: "集群部署"
 title_config: "配置说明"
 title_rebalance: "负载均衡"
 title_monitoring: "可视化监控"
-title_rolling-update: "集群升级"
+title_rolling-update: "集群重启和升级"
 title_scale-in-out: "集群扩容缩容"
 title_resource-management: "资源管理"
 title_cold-backup: "冷备份"

diff --git a/_docs/en/administration/rolling-update.md b/_docs/en/administration/rolling-update.md
@@ -2,4 +2,111 @@
 permalink: administration/rolling-update
 ---
 
-TRANSLATING
+# Design goals
+
+When upgrading the Pegasus server version or persistently modifying the configuration, it is necessary to restart the cluster. For distributed clusters, the commonly used restart method is **Rolling Restart**, which means restarting servers one by one without stopping cluster service.
+
+> The following document assumes that the number of replicas of tables in the Pegasus cluster is 3.
+
+The important goal of cluster restart is to maintain continuous service and minimize the impact on availability. During the restart process, the following factors can affect service availability:
+* After the Replica Server process is killed, the replicas served by the process cannot provide services:
+  * For primary replicas: Since the primary replicas directly provide reading and writing services to the client, killing a process will definitely affect read and write operations, and it needs to wait for the Meta Server to reassign new primary replicas before it can be recovered. The Meta Server maintenance the survival status of the Replica Servers through beacons, and the latency of Failure Detector depends on the configuration parameter `fd_grace_seconds`, default to 10 seconds, which means it takes up to 10 seconds for the Meta Server to know that the Replica Server is down, and then reassign new primary replicas.
+  * For secondary replicas: Since the secondary replicas do not serve reads, theoretically they have no impact on reads. But it will affect writing because the PacificA consistency protocol requires all replicas to be written successfully before the write operation can be submitted. After the process is killed, the primary replica will find that the secondary replica has been lost during the write operation, and then notify the Meta Server to kick it out. After the _configuration_ stage, the replica group is combined by one primary and one secondary replica, then continuing to provide write services. For write operations that have not yet been completed during this switching process, even if there is a _reconciliation_ stage to execute again, the client may have timed out, which has a certain impact on availability. However, this impact is relatively small because the speed of _reconfiguration_ is relatively fast and can usually be completed within 1 second.
+* Restarting Meta Server: The impact of restarting Meta Server on availability can be almost negligible. Because the client retrieves the service node information for each partition from the Meta Server for the first time and caches the information locally, there is usually no need to query from Meta Server again. Therefore, a short disconnection during the Meta Server restart process has little impact on the client. However, considering that the Meta Server needs to maintain beacons with the Replica Server, it is important to avoid stopping the Meta Server process for a long time, which could cause the Replica Server to be disconnected.
+* Restarting the Collector: Restarting the Collector has no impact on availability. However, availability metrics are collected from the Collector, so it may have a slight impact on the metrics data.
+
+Therefore, the following points can be considered to keep availability during cluster restart:
+* Only one process can be restarted at a time, and the next process can only be restarted after the process is restarted and fully recovered to provide service. Because:
+  * If the cluster does not recover to a fully healthy state after restarting a process, and some partitions still have only one primary and one secondary replica, then killing another Replica Server process is likely to enter a state with only one primary replica, making it unable to provide write service.
+  * Waiting for all partitions in the cluster to recover three replicas before restarting the next process can also reduce the risk of data loss.
+* Proactively migrate replicas before Failure Detector delays impact availability, instead passively migrate. Because:
+  * Passive migration requires waiting for the Failure Detector to detect Replica Server loss, while proactive migration involves migrating the primary replicas served by this server to other servers before killing the process. This `reconfiguration` procedure is fast and typically takes less than 1 second to complete.
+* Try to manually downgrade the secondary replicas of the Replica Server served before killing the process. Because:
+  * Proactively trigger the `reconfiguration` rather than passive triggering on write failures, further reducing the impact on availability.
+* Minimize the workload of the recovery process during process restart to shorten the process restart time.
+  * Replica Server requires replay WAL logs to recover data upon restart. If it is killed directly, the amount of data that needs to be replayed may be large. However, if the flush operation of memtables to disk is actively triggered before killing, the amount of data that needs to be replayed during restart will be greatly reduced, and the restart time will be much shorter. The time required for the entire cluster to restart can also be greatly reduced.
+* Minimize unnecessary data transmission between servers to avoid availability impacts caused by high load of CPU, network IO, and disk IO when transmit data.
+  * After the Replica Server crashes, some partitions enter the state of `1 primary + 1 secondary`. If the Meta Server immediately supplements replicas on other Replica Servers, it will bring about a large number of cross server data transmission, increase CPU, network IO, and disk IO load, and affect cluster stability. Pegasus's solution to this problem is to allow `1 primary + 1 secondary` state for a period of time, providing a maintenance window for the restarted Replica Server. If it's not recovered for too long time, the missing replicas will be replenished on other Replica Servers. This balances the data integrity and the stability of the cluster. The wait time can be configured though the parameter `replica_assign_delay_ms_for_dropouts`, default to 5 minutes.
+
+# Restart steps
+
+## High availability restart steps
+
+* If it is an upgrade, please prepare new server packages and configuration files first
+* Use shell tools to set the meta level of the cluster to `steady`, turn off [load balancing](rebalance), and avoid unnecessary replica migration
+  ```
+  >>> set_meta_level steady
+  ```
+* Use shell tools to set the maintenance window of a single Replica Server
+  ```
+  >>> remote_command -t meta-server meta.lb.assign_delay_ms $value
+  ```
+  `value` can be understood as the maintenance window of a single Replica Server, which is the trigger time for the Meta Server to supplement replicas to other servers after discovering that the Replica Server is lost. For example, configure to `3600000`.
+* Restart the Replica Server process one by one. Restart a single Replica Server steps:
+  * Use shell tools to send [remote commands](remote-commands#meta-server) to Meta Server, temporarily disable `add_secondary` operations:
+    ```
+    >>> remote_command -t meta-server meta.lb.add_secondary_max_count_for_one_node 0
+    ```
+  * Use `migrate_node` command to transfer all primary replicas on the Replica Server to other servers:
+    ```bash
+    $ ./run.sh migrate_node -c $meta_list -n $node -t run
+    ```
+    Use shell tools to check the replicas of the servers served through the `nodes -d` command, and wait for the number of **primary** replicas to become 0. If it doesn't change to 0 for a long time, please execute the command again.
+  * Use `downgrade_node` command to downgrade all secondary replicas on the Replica Server to `INACTIVE`:
+    ```bash
+    $ ./run.sh downgrade_node -c $meta_list -n $node -t run
+    ```
+    Use shell tools to check the replicas of the servers served through the `nodes -d` command, and wait for the number of **secondary** replicas to become 0. If it doesn't change to 0 for a long time, please execute the command again.
+  * Use shell tools to send a remote command to the Replica Server to close all replicas and trigger flush operations:
+    ```
+    >>> remote_command -l $node replica.kill_partition
+    ```
+    Wait for about 1 minute for the data to be flushed to the disk to complete.
+  * If it is an upgrade, replace the package and configuration file
+  * Restart the Replica Server process
+  * Use shell tools to send [remote commands](remote-commands#meta-server) to Meta Server, enable `add_secondary` operations, let it quickly supplement replicas:
+    ```
+    >>> remote_command -t meta-server meta.lb.add_secondary_max_count_for_one_node 100
+    ```
+  * Use the `ls - d` command of the shell tool to check the cluster status and wait for all partitions to fully recover health
+  * Continue with the next Replica Server
+* Restart the Meta Server process one by one. Restart a single Meta Server steps:
+  * If it is an upgrade, replace the package and configuration file
+  * Restart the Meta Server process
+  * Wait for more than 30 seconds to ensure the continuity of beacons between Meta Server and Replica Servers
+  * Continue with the next Meta Server
+* Restart the Collector process:
+  * If it is an upgrade, replace the package and configuration file
+  * Restart the Collector process
+* Reset configurations
+  * Reset the configurations modified in the above steps using shell tools:
+    ```
+    >>> remote_command -t meta-server meta.lb.add_secondary_max_count_for_one_node DEFAULT
+    >>> remote_command -t meta-server meta.lb.assign_delay_ms DEFAULT
+    ```
+
+## Simplified restart steps
+
+If the availability requirement is not high, the restart steps can be simplified as follows:
+* If it is an upgrade, please prepare new server packages and configuration files first
+* Use shell tools to set the meta level of the cluster to `steady`, turn off [load balancing](rebalance), and avoid unnecessary replica migration
+  ```
+  >>> set_meta_level steady
+  ```
+* Restart the Replica Server process one by one. Restart a single Replica Server steps:
+  * If it is an upgrade, replace the package and configuration file
+  * Restart the Replica Server process
+  * Use the `ls - d` command of the shell tool to check the cluster status and wait for all partitions to fully recover health
+  * Continue with the next Replica Server
+* Restart the Meta Server process one by one. Restart a single Meta Server steps:
+  * If it is an upgrade, replace the package and configuration file
+  * Restart the Meta Server process
+  * Wait for more than 30 seconds to ensure the continuity of beacons between Meta Server and Replica Servers
+  * Continue with the next Meta Server
+* Restart the Collector process:
+  * If it is an upgrade, replace the package and configuration file
+  * Restart the Collector process
+
+# Restart script
+
+It can be referenced the script based on [Minos](https://github.com/XiaoMi/minos) and [High availability restart steps](#high-availability-restart-steps): [scripts/pegasus_rolling_update.sh](https://github.com/apache/incubator-pegasus/blob/master/scripts/pegasus_rolling_update.sh).
diff --git a/_docs/en/administration/scale-in-out.md b/_docs/en/administration/scale-in-out.md
@@ -2,7 +2,7 @@
 permalink: administration/scale-in-out
 ---
 
-# Design goal
+# Design goals
 
 When the storage capacity of the cluster is insufficient or the read/write throughput is too high, it is necessary to scale out the capacity by adding more nodes. On the contrary, scaling in can be achieved by reducing the number of nodes.
 

diff --git a/_docs/zh/administration/administration.md b/_docs/zh/administration/administration.md
@@ -15,7 +15,7 @@ Pegasus 不仅仅只提供简单的 key value 存储接口，我们还基于稳
 如果有机器发生持久性的故障，你也可以参照 [集群扩容缩容](scale-in-out) 剔除这个坏节点。
 如果是机器的某个SSD盘出故障，可以参照 [坏盘检修](bad-disk) 剔除这个坏盘。
 
-如果需要升级集群，请参照 [集群升级](rolling-update)。
+如果需要重启或升级集群，请参照 [集群重启](rolling-update)。
 
 集群运行过程中，你需要时刻关注资源（磁盘、内存、网络）的使用情况，并及时做出运维调整，请参照 [资源管理](resource-management)。
 

diff --git a/_docs/zh/administration/bad-disk.md b/_docs/zh/administration/bad-disk.md
@@ -32,7 +32,7 @@ Pegasus 支持磁盘黑名单，如果你要下线某块磁盘，首先要把它
 
 ## 重启节点
 
-在你标注好坏盘名单后，你可以通过 [高可用升级](rolling-update) 单独重启对应节点的服务进程。
+在你标注好坏盘名单后，你可以通过 [高可用重启](rolling-update#高可用重启) 单独重启对应节点的服务进程。
 通常你在程序日志中能够看到下列记录，表示黑名单内的磁盘的确被忽略了：
 
 ```log