Skip to content

Commit

Permalink
Update experiences docs (apache#74)
Browse files Browse the repository at this point in the history
  • Loading branch information
acelyc111 authored Feb 4, 2024
1 parent 5afb2d2 commit c4de635
Show file tree
Hide file tree
Showing 2 changed files with 131 additions and 58 deletions.
80 changes: 79 additions & 1 deletion _docs/en/administration/experiences.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,4 +2,82 @@
permalink: administration/experiences
---

TRANSLATING
The administrator work of a distributed system includes periodic inspections, monitoring & alarms, troubleshooting, access auditing, etc. to help ensure the stability of service.

# Periodic inspection

* Availability: Availability remains at 100% normally. Occasionally, in the event of node failures or other anomalies, availability may fall below 100%
* IOPS: The sudden increase in IOPS may affect service stability, while the sudden decrease in traffic may be caused by service issues
* Read and write latency: The P99 latency spikes of read and/or write operations may affect Pegasus users
* System resources usage: Whether CPU, memory, disk usage, network bandwidth and connection count have skyrocketed, and whether they have reached the high water mark

# Monitoring and alarms

Refer to [Monitoring](/administration/monitoring).

# Troubleshooting

Use the [Shell tools](/overview/shell) to check the status of Pegasus:
* Whether the basic information of the cluster is normal:
* Whether the `meta_servers` list is normal
* Whether the value of `meta_function_level` is `steady`
* Whether each table and each partition is health:`ls -d`
* Whether the count of table count is correct
* Whether the number of all table's `unhealthy` partition count is 0
* Whether each server is health: `nodes -d`
* All servers are in the list and their status is `ALIVE`
* Is the data distribution severely skewed (i.e. the number of `replica_count` or `primary_count` columns in the list is imbalance). If severely skewed, it's recommended to use the shell tool command `set_meta_level` to set it to `lively` in a time window with relatively low traffic, then load balancing performed. Remember to reset it to `steady` state when they are balanced.
> Note: For latency sensitive users, load balancing can only be performed when necessary and should not affect service stability. During the process, the cluster status should be closely observed
* Whether the basic information of each server is normal: `server_info`
* Whether each server version is correct
* Determine whether a restart has occurred through each server's _start time_
* Whether the metrics of each server is normal: `server_stat`
* IOPS and latency
* Memory usage
* Whether the metrics of each table is normal: `app_stat`
* IOPS
* Disk usage

Check the system information:
For example, check the count of socket connections on the server (where `34601` is the service listening port of Meta Server):
* Use the `netstat` command on the server where the Meta Server is deployed to check the count of connections:
```bash
netstat -na | grep '34601\>' | grep ESTABLISHED | wc -l
```
* Check the remote nodes that have established a connection with the server, sorted by the count of connections:
```bash
netstat -na | grep '34601\>' | grep ESTABLISHE | awk '{print $5}' | sed 's/:.*//' | sort | uniq -c | sort -k1 -n -r | head
```
* If there are too many connections (for example, if the count of a single node connections exceeds 100), further analysis is needed to determine the cause.

## Common troubleshooting methods

* If the service process exits abnormally, it is necessary to log in to the corresponding server to check the reason:
* Check to abnormal exit reason via `dmesg` or `/var/log/messages`
* If it's `Out of memory: Killed process xxx`: Check the memory usage monitoring of Meta Server or Replica Server and analyze for any abnormal issues
* If it's `segfault at xxx`:
* Check the standard error output logs and server logs of Meta Server or Replica Server
* Check if there is a coredump file generated, and use `gdb` for analysis if there is. If there is no coredump file, set the system and user's `ulimit` as needed.
* If there are many faulty servers, consider to use the `set_meta_level` command to set it to `freezed` state to avoid service avalanches
* If the process keeps restarting (abnormally exiting and being restarted by other process monitoring services), consider temporarily stopping the process monitoring service to automatically restart the Pegasus process
* If remote login (such as `ssh`) to the server is not available, it is possible that the physical server has shutdown. Please contact the service provider for assistance
# Audit when user apply Pegasus service
Pegasus, like most databases, manage resources in the unit of _table_. As Pegasus administrators, when user apply Pegasus table, it is necessary to understand the resources required by the table in order to allocate appropriate computing and storage resources. Consider Pegasus storage principles and optimizing the key-value schema design can also help ensure service stability.
The following information can be collected and analyzed:
* Table name
* Read operation peak (QPS)
* Total number of reads operations (operations/day)
* Write operation peak (TPS)
* Total number of reads operations (operations/day)
* Key-value design schema (to determine if there is a data skew issue)
* Read/write mode (to determine if there are read or write hotspot issues)
* Average size of each key-value (KB)
* Estimated total data usage (GB)
* Growth estimate (e.g. 6 months/1 year/3 years of growth)
* Read operation latency required (P99 latency)
* Write operation latency required (P99 latency)
* IOPS characteristic (such as all-day equilibrium, smooth with peaks and valleys, timed batch writes, etc.)
109 changes: 52 additions & 57 deletions _docs/zh/administration/experiences.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,87 +2,82 @@
permalink: administration/experiences
---

任何一个分布式系统的运维工作都少不了周期巡检,问题排查,故障报警,人工审核。它们是保证服务稳定运行的关键。
这里整理Pegasus的监控指标,你可以根据需要接入到你的运维工具中。
一个分布式系统的Meta Server管理工作包含周期巡检,监控报警,故障排查,接入审核等,通过这些手段来帮助服务稳定运行。

## 周期巡检
# 周期巡检

* **可用度**:正常时可用度会保持在100%,发生节点故障等异常偶尔会有可用度低于100%的情况
* 可用性:正常时可用性会保持在 100%,发生节点故障等异常偶尔会有可用性低于 100% 的情况
* IOPS:流量的突增可能导致服务稳定性受到影响,而流量的突降可能是服务已故障所致
* 读写延迟:读写操作的 P99 延迟可能有异常毛刺的情况,对 Pegasus 用户产生影响
* 系统资源使用:CPU、内存、磁盘的使用量,网络带宽及连接数是否出现暴涨、是否达到高水位线

* **总QPS**:异常流量的突增或者突降有时会导致服务抖动
# 监控报警

* **读写延迟**:P99读延迟和P99写延迟可能有异常毛刺的情况,对用户会造成影响
参考 [可视化监控](/administration/monitoring).

* **内存使用**:关注内存使用是否正常,譬如memory是否出现暴涨、是否达到了警戒线

* **存储使用**:关注磁盘存储使用是否正常,预估存储是否够用

## 问题排查
# 问题排查

使用 [Shell 工具](/overview/shell) 查看 Pegasus 系统状态:
* 集群基础信息是否正常:`cluster_info`
* meta_servers列表是否正确
* primary_meta_server是否为第一个(因为推荐使用第一个,第二个节点上可能部署有数据节点)
* meta_function_level是否是steady状态
* 各Table、各Partition是否健康:`ls -d`
* Table数量是否正常
* 所有Table的unhealthy_num(没有达到一主一备的partition数量)和partly_healthy_num(没有达到一主两备的partition数量)是否都为0
* `meta_servers` 列表是否正常
* `meta_function_level` 是否是 `steady` 状态
* 各 Table、各 Partition 是否健康:`ls -d`
* Table 数量是否正常
* 所有 Table 的 `unhealthy` 分片数量是否都为 0
* 各节点是否健康:`nodes -d`
* 所有节点是否都是ALIVE状态
* 数据分布是否倾斜严重如果倾斜严重,可以选择集群流量比较小的时间段将meta_function_level设置为lively进行负载均衡调整,并在调整完成后设置回steady状态
* 注意:负载均衡只有在必要的时候才进行,前提是不要影响服务稳定性,因此不要频繁操作;在调整过程中要全程监控集群状态
* 所有节点都在列表中,且状态都是 `ALIVE`
* 数据分布是否倾斜严重(即 `replica_count` 列或 `primary_count` 列数量不平均)。如果倾斜严重,可以选择集群流量比较小的时间段,使用 shell 工具命令 `set_meta_level` 设置为 `lively`,使其进行负载均衡调整。记得在调整完成后设置回 `steady` 状态
> 注意:对于延迟敏感的用户,负载均衡只能在必要的时候才进行,不要影响服务稳定性,在该过程中要密切观察集群状态
* 各节点的基本信息是否正常:`server_info`
* Server版本是否正确
* 通过Start Time判断是否发生过重启
* 各节点的实时统计信息是否正常:`server_stat`
* 读写QPS、读写延迟
* SharedLog大小
* 每个 server 的版本是否正确
* 通过 start time 判断是否发生过重启
* 各节点的 metrics 信息是否正常:`server_stat`
* IOPS、读写延迟
* 内存使用量
* 各Table的实时统计信息是否正常:`app_stat`
* 各操作的QPS情况是否正常
* 各Table的存储用量是否正常
* 检查机器的socket连接数:
* 到MetaServer所在机器上使用netstat命令检查连接数:
* 各表的 metrics 信息是否正常:`app_stat`
* IOPS
* 存储用量

查看系统信息:
例如,检查服务器的 socket 连接数(其中 `34601` 为 MetaServer 的服务监听端口):
* 在 Meta Server 所在服务器上使用 `netstat` 命令检查连接数:
```bash
netstat -na | grep '601\>' | grep ESTABLISHED | wc -l
netstat -na | grep '34601\>' | grep ESTABLISHED | wc -l
```

* 检查与该机器建立连接的远程节点,按照连接数排序:

* 检查与该服务器建立连接的远程节点,按照连接数排序:
```bash
netstat -na | grep '601\>' | grep ESTABLISHE | awk '{print $5}' | sed 's/:.*//' | sort | uniq -c | sort -k1 -n -r | head
netstat -na | grep '34601\>' | grep ESTABLISHE | awk '{print $5}' | sed 's/:.*//' | sort | uniq -c | sort -k1 -n -r | head
```
* 如果连接数太多(例如单节点连接数超过 100),就需要进一步分析原因。

* 如果连接数太多(譬如单节点连接数超过100),就需要进一步分析原因。

常见故障处理办法:
## 常见故障处理方法

* 如果节点挂掉重启,需要登录到对应机器上,检查原因:
* 通过server的日志
* 通过core文件;如果没有core文件,需要检查ulimit配置是否正确,或者通过dmesg或者/var/log/messages查看是否因为OutOfMemory原因被系统杀死
* 如果出故障机器较多,可以考虑将meta置为freezed状态,避免雪崩
* 进程不停重启,可以考虑停止进程
* 机器无法从relay连接,有可能是宕机了,快速联系系统运维人员
* 注意系统的参数:CPU情况、diskIO负载和latency、network负载和latency、socket个数
* 通过dmesg查看内核报错
* 如果服务进程异常退出,需要登录到对应服务器上,检查原因:
* 查看 `dmesg``/var/log/messages` 确认进程退出原因
* 如果是 `Out of memory: Killed process xxx`:查看 Meta Server 或 Replica Server 的内存使用监控,分析是否有异常现象
* 如果是 `segfault at xxx`
* 查看 Meta Server 或 Replica Server 的标准错误输出日志和服务日志
* 检查是否有 coredump 文件生成,有则使用 `gdb` 分析;如果没有 coredump 文件,则按需设置系统和用户的 `ulimit`
* 如果出故障服务器较多,可以考虑将设置 `set_meta_level` 置为 `freezed` 状态,避免服务雪崩
* 如果进程不断重启(异常退出,又被其他进程监控服务拉起),可以考虑临时停止进程监控服务自动地拉起 Pegasus 进程
* 如果无法远程登录(如 `ssh`)到该服务器,有可能是物理机发生宕机,请联系服务提供方处理

## 需求审核
# Pegasus 服务接入审核

Pegasus和多数数据库一样,以表的方式管理资源。
每个表需要的资源量需要提前告知,这样我们才能为需求分配合适的计算存储资源。
除此外,与业务深度交流,定制最合适的存储方案也有助于后期服务的稳定运行。
Pegasus 和多数数据库一样,以 _表_ 为单位管理资源。作为 Pegasus 的管理员,在每个表接入时,需要了解表需要的资源量,以便分配合适的计算和存储资源。结合 Pegasus 的存储原理,优化 key-value 的 schema 设计,也有助于保证服务的稳定性。

有哪些重要的需求需要提前审核
可以收集分析以下信息

* 表名
* 读峰值(QPS)
* 读总量(条/天)
* 写峰值(QPS)
* 写总量(条/天)
* key-value 设计模式(以此判断是否存在数据倾斜问题)
* 访问模式(判断是否存在热点读写问题)
* 单条数据平均大小(KB/条)
* 数据总量预估 (GB)
* 增长预估(6个月/1年/3年与目前相比倍数)
* 读延迟需求(毫秒/P99)
* 写延迟需求(毫秒/P99)
* 访问特征(如定时批量写入)
* 是否存在既有数据需导入/数据规模
* 数据总量预估(GB)
* 增长预估(例如 6 个月 / 1 年 / 3 年的增长量)
* 读延迟需求(P99 延迟)
* 写延迟需求(P99 延迟)
* IOPS 特征(例如全天均衡、平滑的波峰与低谷、定时的批量写入等)

0 comments on commit c4de635

Please sign in to comment.