Skip to content

Commit

Permalink
Update scale-in-out docs (#67)
Browse files Browse the repository at this point in the history
  • Loading branch information
acelyc111 authored Jan 30, 2024
1 parent 194f47a commit 0aa87d9
Show file tree
Hide file tree
Showing 3 changed files with 90 additions and 14 deletions.
75 changes: 74 additions & 1 deletion _docs/en/administration/scale-in-out.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,77 @@
permalink: administration/scale-in-out
---

TRANSLATING
# Design goal

When the storage capacity of the cluster is insufficient or the read/write throughput is too high, it is necessary to scale out the capacity by adding more nodes. On the contrary, scaling in can be achieved by reducing the number of nodes.

> The scaling in and scaling out described in this document are for replica servers.
When scale out or scale in the cluster, it's necessary to consider:
* Do not stop Pegasus service
* Try not to affect availability as much as possible
* Minimize unnecessary data transmission as much as possible

# Scale out steps

The scale out steps are relatively simple:
* To add multiple servers, start the replica server process on these newly added servers. After starting, the replica server will actively contact the meta server and join the node list.
* When the meta level is `steady`, [load balancing](rebalance) is not performed. Therefore, when using the `nodes -d` command in the shell tool, you can see that the status of the newly added node is in `ALIVE` status, but the count of replicas served by the node is `0`.
* Set through shell tool `set_meta_level lively` to start load balancing, and the meta server will gradually migrate some replicas to the newly added node.
* Observe the servicing replicas through the `nodes -d` command of the shell tool. After reaching balanced status, use the `set_meta_level steady` to turn off load balancing and complete the scale out process.

# Scale in steps

There are more factors to consider when scaling in compared to scaling out, mainly including:
* If multiple nodes need to be removed from the cluster at the same time, they need to be removed one by one, and wait for one to be removed completely before removing another to avoid affecting the availability of the cluster and data integrity.
* If multiple nodes need to be removed from the cluster at the same time, when removing one node, it is advisable to avoid the meta server assigning replicas to other nodes that are about to be removed when curing replicas. Otherwise, when removing other nodes, it has to cure the replicas again, resulting in unnecessary cross node data transmission. [Black_list](/administration/rebalance#assign_secondary_black_list) is provided for this aim.

> Note: When the node has been removed, its status on the meta server will change to `UNALIVE`, which may cause the proportion of `ALIVE` nodes to be lower than the configuration value of `node_live_percentage_threshold_for_update`, then the meta server will automatically downgrade to the `freezed` state, then all `reconfiguration` operations (i.e. reassigning replicas operations) cannot be performed, and the scaling in process cannot be performed. So before scaling in, it is necessary to calculate whether the situation would be caused. If so, modify the configuration of the meta server and set the `node_live_percentage_threshold_for_update` to low enough to ensure that the meta server does not automatically downgrade to the `freezed` state during the scaling in process.
## Recommended scaling in steps

* Calculate the proportion of `ALIVE` nodes after scaling in, if it is lower than configuration value of `node_live_percentage_threshold_for_update`, then use [remote commands](/administration/remote-commands) to update the value to be small enough.
```
>>> remote_command -t meta-server meta.live_percentage $percentage
```
`percentage` is an integer with a value range of [0, 100].
* Using shell tools command `set_meta_level` to set the cluster to `steady` mode and disable the [rebalance](rebalance) to avoid unnecessary replica migration.
```
>>> set_meta_level steady
```
* Use shell tools to send [remote commands](remote-commands#meta-server) to the meta server to update `assign_secondary_black_list`:
```
>>> remote_command -t meta-server meta.lb.assign_secondary_black_list $address_list
```
`address_list` is the `ip:port` list of nodes to be removed, separated by commas.
* Use shell tools to set `assign_delay_ms` to 10, to make it possible to cure replicas immediately on other alive nodes after the node has been removed:
```
>>> remote_command -t meta-server meta.lb.assign_delay_ms 10
```
* Remove replica servers one by one. The removing steps for a single replica server:
* Kill the replica server process that you want to remove.
* Use shell tools command `ls -d` to check the cluster status, wait for all partitions to be fully recovered to health status (all tables have 0 unhealthy partition counts).
* Clean up the data on this node to free up disk space.
* Restart the meta server:
* Restarting the meta server is to clear the records of the removed nodes (i.e. no longer displaying removed nodes in the `nodes -d` command of the shell tools), reset the modified configuration items mentioned above.

## Script

The above steps are completed by the script [scripts/pegasus_offline_node_list.sh](https://github.com/apache/incubator-pegasus/blob/master/scripts/pegasus_offline_node_list.sh).
> However, this script cannot be used directly because it relies on the [minos deployment tool](https://github.com/XiaoMi/minos).
# Nodes migration

The nodes migration of the cluster can be achieved by first scaling out and then scaling in. To minimize unnecessary data transmission, it is recommended to follow the following steps:
* Scaling out: Add the new servers to the cluster, but temporarily do not perform [rebalance](/administration/rebalance) after joining.
* Scaling in: Remove the old servers through the [Scale in steps](#scale-in-steps) above.
* Perform [rebalance](/administration/rebalance).

# Other configurations

* Limit the migration speed. It can be achieved by limiting the read and write bandwidth per disk to avoid the performance impact caused by high disk IO throughput.
```
>>> remote_command -t replica-server nfs.max_send_rate_megabytes_per_disk $rate
>>> remote_command -t replica-server nfs.max_copy_rate_megabytes_per_disk $rate
```
The unit of `rate` is `MB/s`.
25 changes: 14 additions & 11 deletions _docs/zh/administration/scale-in-out.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,26 +6,28 @@ permalink: administration/scale-in-out

当集群存储容量不够或者读写吞吐太大了,需要通过增加节点来扩容;反之,可以通过减少节点来缩容。

> 本文所述的扩容缩容是针对 replica server。
扩容和缩容时,需要考虑这些点:
* 不停服
* 不要停止 Pegasus 服务
* 尽量不要影响可用性
* 尽量减少不必要的数据拷贝
* 尽量减少不必要的数据传输

# 扩容流程

扩容流程比较简单:
* 要扩容多个服务器,就在这些新增服务器上启动 replica server 进程,启动后 replica server 会主动联系 meta server,加入节点列表中。
* 在 meta level 为 `steady` 时,不会进行 [负载均衡](rebalance),因此用 shell 工具的 `nodes -d` 命令查看,可以看到新节点的状态为 `ALIVE`,但是该节点服务的 replica 个数为 0。
* 通过 shell 工具的 `set_meta_level lively` 启动负载均衡,meta server 会逐渐将部分 replica 迁移到新节点上。
* 通过 shell 工具的 `nodes -d` 命令查看个节点服务 replica 的情况,在达到均衡状态后,通过 `set_meta_level steady` 关闭负载均衡,扩容完成。
* 通过 shell 工具的 `nodes -d` 命令查看各节点服务 replica 的情况,在达到均衡状态后,通过 `set_meta_level steady` 关闭负载均衡,扩容完成。

# 缩容流程

缩容相对扩容要考虑的点就多些,主要包括:
* 如果同时要下线多个节点,需要一个一个进行,等一个下线完成后再下线另一个,避免影响集群的可用度和数据的安全性
* 如果同时要下线多个节点,需要一个一个进行,等一个下线完成后再下线另一个,避免影响集群的可用性和数据的完整性
* 如果同时要下线多个节点,那么在下线一个节点时,要尽量避免 meta server 在补充副本时将副本分派到即将要下线的其他节点上,否则在下线其他节点时,又要重新补充副本,造成不必要的跨节点数据拷贝。我们提供了 [black_list](/administration/rebalance#assign_secondary_black_list) 来支持这个功能。

> 注意:节点下线后,在 meta server 上的状态会变成 `UNALIVE`,可能会造成 `ALIVE` 的节点比例低于配置参数 `node_live_percentage_threshold_for_update`。如果低于了该限制,meta server 就会自动降级为 `freezed` 状态,此时所有的 `reconfiguration` 操作(即重新分派副本的操作)都无法进行,缩容流程也将无进继续进行。所以在缩容之前需要计算一下是否会造成这种情况,如果会,就先修改 meta server 的配置,将 `node_live_percentage_threshold_for_update` 修改至足够低,以保证在缩容过程中 meta server 不会自动降级为 `freezed` 状态。
> 注意:节点下线后,在 meta server 上的状态会变成 `UNALIVE`,可能会造成 `ALIVE` 的节点比例低于配置参数 `node_live_percentage_threshold_for_update`,此时,meta server 就会自动降级为 `freezed` 状态,此时所有的 `reconfiguration` 操作(即重新分派副本的操作)都无法进行,缩容流程也将无进继续进行。所以在缩容之前需要计算一下是否会造成这种情况,如果会,就先修改 meta server 的配置,将 `node_live_percentage_threshold_for_update` 修改至足够低,以保证在缩容过程中 meta server 不会自动降级为 `freezed` 状态。
## 推荐的缩容流程

Expand All @@ -38,7 +40,7 @@ permalink: administration/scale-in-out
```
>>> set_meta_level steady
```
* 使用 shell 工具向 meta server 发送 [远程命令](remote-commands#meta-server),设置 black_list
* 使用 shell 工具向 meta server 发送 [远程命令](remote-commands#meta-server) 来更新 `assign_secondary_black_list`
```
>>> remote_command -t meta-server meta.lb.assign_secondary_black_list $address_list
```
Expand All @@ -49,25 +51,26 @@ permalink: administration/scale-in-out
```
* 逐个下线 replica server。单个 replica server 下线流程:
* kill 掉想要下线的 replica server 进程。
* 使用 shell `ls -d` 命令查看集群状态,等待所有 partition 都完全恢复健康(所有表的 unhealthy 数都为 0)。
* 使用 shell 工具的 `ls -d` 命令查看集群状态,等待所有 partition 都完全恢复健康(所有表的 unhealthy partition 数都为 0)。
* 清理该节点上的数据,释放磁盘空间。
* 重启 meta server:
* 重启 meta server 是为了清理已下线节点的记录(即在 shell 工具的 `nodes -d` 不再显示已经下线的节点),并重置以上修改过的配置项。

以上过程可以自动化,我们提供了集群升级脚本 [scripts/pegasus_offline_node_list.sh](https://github.com/apache/incubator-pegasus/blob/master/scripts/pegasus_offline_node_list.sh)。不过这个脚本并不能直接使用,因为其依赖 minos 部署工具来完成进程的远程 stop 操作。`pegasus_offline_node_list.sh` 调用 `pegasus_offline_node.sh`,因此这两个脚本的 minos_client_dir 都需要更改。你可以针对你们自己的部署系统,修改脚本中 minos 相关部分,使其可以正常工作。如需帮助,请联系我们。
## 脚本

注意:在使用集群升级脚本的时候,也要保证配置参数 `node_live_percentage_threshold_for_update` 的值足够小(有必要可以先升级 meta-server),避免使集群进入 freezed 状态。
以上过程已被脚本 [scripts/pegasus_offline_node_list.sh](https://github.com/apache/incubator-pegasus/blob/master/scripts/pegasus_offline_node_list.sh) 实现。
> 不过该脚本不能直接使用,因为他依赖 [minos 部署工具](https://github.com/XiaoMi/minos).
# 节点迁移

通过先扩容,再缩容的方式,来实现集群的节点迁移。为了尽量减少数据的重复拷贝,建议按照如下步骤:
通过先扩容,再缩容的方式,来实现集群的节点迁移。为了尽量减少不必要的数据传输,建议按照如下步骤:
* 先扩容:将需要扩容的服务器加入到集群中,但是在加入后暂时不进行 [负载均衡](/administration/rebalance#控制集群的负载均衡)
* 再缩容:将需要缩容的服务器通过上面的 [缩容流程](#缩容流程) 进行下线。
* 执行 [负载均衡](/administration/rebalance#控制集群的负载均衡)

# 其他配置

* 迁移限速。可以设置单块磁盘的读取或写入速度,避免高吞吐带来的性能影响。
* 迁移限速。可以设置单块磁盘的读写带宽,避免高吞吐带来的性能影响。
```
>>> remote_command -t replica-server nfs.max_send_rate_megabytes_per_disk $rate
>>> remote_command -t replica-server nfs.max_copy_rate_megabytes_per_disk $rate
Expand Down
4 changes: 2 additions & 2 deletions _docs/zh/administration/whitelist.md
Original file line number Diff line number Diff line change
Expand Up @@ -40,12 +40,12 @@ Pegasus的白名单功能用来防止非预期的replica server加入集群。

由于扩容需replica server先与meta server通信,如果此时白名单未更新,会导致meta server拒绝这个新replica server加入集群。

所以,开启白名单的集群扩容步骤,需要在普通[扩容流程](membership-change#扩容流程)前,进行一下步骤:
所以,开启白名单的集群扩容步骤,需要在普通[扩容流程](/administration/scale-in-out#扩容流程)前,进行一下步骤:
1. 修改meta server白名单配置,加入新replica servers
2. 重启meta server

### 缩容

[缩容流程](membership-change#缩容流程)中,白名单不会造成任何影响。白名单的更新也可以在缩容完成之后任意时刻进行。
[缩容流程](/administration/scale-in-out#扩容流程)中,白名单不会造成任何影响。白名单的更新也可以在缩容完成之后任意时刻进行。

但为了安全,建议及时更新白名单。只需在缩容流程的最后一步“重启meta server”前,修改meta server的白名单配置。

0 comments on commit 0aa87d9

Please sign in to comment.