Skip to content

Apache Pegasus 2.4.0 Release Notes

Compare
Choose a tag to compare
@acelyc111 acelyc111 released this 31 Oct 16:24
· 572 commits to master since this release
8bdca79

Release Note

Apache Pegasus 2.4.0 is a feature list, the change-list is summarized here: #1032

New module

Apache Pegasus contains many ecological projects. In the past, they were maintained in different repository. Starting from this version, they will be gradually donated to the official Apache Pegasus repository. Currently, the following projects are donated to Apache Pegasus:

  • RDSN: in the past, rdsn was linked to Apache Pegasus as a sub module of GIT. Today, it has officially become the core module of Apache Pegasus.
  • PegasusClient: Pegasus supports multiple clients. Now, the following client projects will be donated to Apache Pegasus: Pegasus-Java-Client, Pegasus-Scala-Client, Pegasus-Golang-Client, Pegasus-Python-Client, Pegasus-NodeJS-Client
  • PegasusDocker: Pegasus supports building with docker. In the current version, the official provides dockfile samples of various build environments, and uses githubaction to build corresponding docker images and upload them to DockerHub
  • PegasusShell: Pegasus has used C + + to build shell tools. In the latest version, we have built new shell tools using golang, including AdminCli and for admin and Pegic for user.

New architecture

In the test, we found that the shared-log engine with a single queue will cause a throughput bottleneck. Thanks to the optimization of random writes by concurrent writes of SSDs, we removed the shared-log written in sequence and only kept the private-log as the WAL. After the test, this will bring about a 15-20% improvement in the performance.

New feature

Support replica count update dynamically

In the past, once a table was created, its replica count could not be changed. The new version supports the function of dynamic change of table replica count. User can increase or decrease the count of a serving table, which is transparent to the foreground.

New batchGet interface

The old batchGet interface is only a simple encapsulation of the get interface. It does not have the batch capability. The new interface optimizes the batch operation. It will aggregate multiple requests according to the partiition-hash rules, and then send the aggregated requests in the unified partition to the corresponding nodes of the server atomically. This will improve the throughput of online writing.

Client request limiter

Burst requests from the client will be piled up for the task queue. To avoid this situation, we added queue-controller to limit the task traffic in extreme scenarios.

In the past, Pegasus only controlled the write traffic. In the new version, we also supported the read traffic, which will enhance the stability of the cluster in emergencies.

Jemalloc memory management

Jemalloc is an excellent memory management library. In the past, we only used tcmalloc for memory management. In the new version, we also support jemalloc, the detail bench result see Jemalloc Performance

Multi architecture support

We have added support for MacOS and aarch64 systems, which will improve Pegasus' cross platform capability.

Client Feature

The Java client adds a table creation and deletion interface, and supports batchGetbypartition to adapt the batchGet interface of the server

Go client adapts to RPC interfaces such as bulkload, compact and disk-add on the server side

AdminCli supports node-migrator, node-capacity-balance, table-migrator, table-partition-split command and other functions.

Feature enhancement

Bulkload

The bulkload has added a lot of optimizations for performance, including using direct-io to perform data download, repair duplicate-check, optimize ingest-task strategy and other features to avoid the impact of IO-load on request latency during bulkload.

Bukload supports concurrent tasks of multiple tables meanwhile. However, it should be noted that due to the low-level speed limit, concurrency only allows multiple tables to queue to execute tasks, and does not improve the overall task efficiency unser same rate

Duplication

Duplication removes the dependence on remote file systems and supports checkpoint file transmission between clusters

Duplication supports batch-sending of log files to improve the synchronization efficiency of incremental data

The new duplication, when the user creates the task, the server will first copy the checkpoint files across the cluster, and then automatically synchronize the incremental logs, greatly simplifying the previous process

Other Important

PerfCounter

In the monitoring system, we optimized the CPU cache performance problems caused by false-share issue, and rebuilt the monitoring point system

ManualCompaction

We have added a control interface for ManualCompaction to the latest version so that users can easily trigger a ManualCompaction task and query the current progress in real time

NFS in Learn

NFS is a module for checkpoint transfer between nodes. In the past, the system has been affected by checkpoint transfer. In the new version, we have provided disk-level fine-grained rate control to reduce the impact of checkpoint transfer.

Link-tracking

The new link tracker supports data upload for monitoring systems to obtain link delay statistics

Environment Variables

We changed the deny_write environment, now it can also turn on read-deny at the same time and provide different response information to the client

Cold backup

backup speed will affect request latency, new version we provide dynamic configuration for HDFS upload speed during backup

RocksdB log size limit

sometimes rocksdb logs take up more space, which is limited by the new version

MetaServer

Supports Host domain name configuration

Bug fix

In the latest version, we focused on fix the following problems:

Server

  • Node crash caused by ASIO's thread safety problem
  • IO amplification caused by improper handling of RPC body
  • Data overflow caused by unreasonable type declaration in AIO module
  • Unexpected error when replica is closed

Client

  • The batchMultiGet interface of the Java client cannot obtain data completely
  • The go client cannot access when the server enable the request-drop configuration
  • The go client cannot recovery when encounter the ERR_INVALID_STATE and so on

Performance

In this benchmark, we use the new machine, for the result is more reasonable, we re-run the Pegasus Server 2.3:

  • Machine parameters: DDR4 16G * 8 | Intel Silver4210*2 2.20Ghz/3.20Ghz | SSD 480G * 8 SATA
  • Cluster Server: 3 * MetaServerNode 5 * ReplicaServerNode
  • YCSB Client: 3 * ClientNode
  • Request Length: 1KB(set/get)
Case client and thread R:W R-QPS R-Avg R-P99 W-QPS W-Avg W-P99
Write Only 3 clients * 15 threads 0:1 - - - 56,953 787 1,786
Read Only 3 clients * 50 threads 1:0 360,642 413 984 - - -
Read Write 3 clients * 30 threads 1:1 62,572 464 5,274 62,561 985 3,764
Read Write 3 clients * 15 threads 1:3 16,844 372 3,980 50,527 762 1,551
Read Write 3 clients * 15 threads 1:30 1,861 381 3,557 55,816 790 1,688
Read Write 3 clients * 30 threads 3:1 140,484 351 3,277 46,822 856 2,044
Read Write 3 clients * 50 threads 30:1 336,106 419 1,221 11,203 763 1,276

Known issues

We have upgraded the ZK client version to 3.7. When the ZK version of the server is smaller than this version, the connection may be timeout.

When configuring periodic manual-compaction tasks with environment variables, there may be a calculation error and cause immediate start.

新增特性

  • 新增了动态修改表的副本数功能,允许在运行时修改一张表的副本数
  • 支持读操作的流量控制
  • 支持动态设置不同task的队列长度
  • 支持表级读写开关
  • 支持Jemalloc
  • 支持aarch64平台
  • 支持macOS平台的编译

Java Client

  • 支持batchGetByPartitions接口,它将发往同一partition的get请求打包,以提升性能
  • 支持建表接口createApp
  • 支持删表接口dropApp

Go Client

  • 支持Bulk Load控制接口
  • 支持Manual Compact控制接口
  • 支持磁盘级的数据迁移接口

Admin CLI

  • 支持Bulk Load的控制工具
  • 支持Manual Compact的控制工具
  • 支持Duplication的控制工具
  • 支持Partition Split的控制命令
  • 支持节点数据迁移、表迁移、磁盘容量均衡等控制工具

功能/性能优化

  • 移除shared log只保留private log,简化系统架构,提升系统性能
  • BulkLoad:进行了若干优化,包括降低下载文件和ingest文件的IO负载,优化错误处理逻辑,提升接口的易用性等
  • Duplication:进行了若干优化,包括不再借助如HDFS等外部文件系统而可自行迁移历史数据,批量发送plog以提升性能,提升操作的易用性等
  • Manual Compaction:支持更丰富的查询、控制操作
  • 流量控制:数据迁移、数据备份等功能也支持了流量控制
  • MetaServer列表支持FQDN
  • 限制RocksDB的日志大小
  • 开始实现新的metrics框架(在本次版本中未启用)

代码重构

  • 将Pegasus的子项目rDSN,各语言的client库,CLI访问及控制工具库等项目合入到Pegasus主项目
  • 移除thrift自动生成的Cpp和Java文件

Bugfix

  • 修复高流量访问时,因多线程竞争而引发的crash问题
  • 修复因消息的body size未设置为引发的磁盘和网络流量放大问题
  • 修复当log大小超过2G再进行flush引发的crash问题
  • 修复XFS文件系统上断电而引发的分片元信息丢失的问题
  • 修复关闭分片时,日志报RocksDB的Shutdown in progress的问题
  • 修复开启Prometheus后,因表名中带有-符号而引发的crash问题
  • 修复RocksDb相关的的recent.flush.completed.count,recent.flush.output.bytes指标不更新的问题
  • 修复日志中的文件名被改写成compiler_depend.ts的问题
  • 修复一次性备份数据发生超时,引发的crash的问题
  • 修复分片的数据目录变空时,不报错而能正常启动的问题
  • 修复Python3 client处理str类型出错的问题

基础建设

  • 将镜像仓库迁移到DockerHub的apache/pegasus空间
  • 完善并精细化控制GitHub的workflow,使得CI过程更稳定且省时

性能测试

测试环境

  • Framework: YCSB
  • Server: DDR4 16G * 8, Intel Silver4210*2 2.20Ghz/3.20Ghz, SSD 480G * 8 SATA
  • OS: Centos7 5.4.54-2.0.4.std7c.el7.x86_64
  • Cluster: 3 * Meta Server + 5 * Replica Server
  • YCSB Client: 3 * ClientNode
  • Request Size: 1KB (set/get)

测试结果

Case client and thread R:W R-QPS R-Avg R-P99 W-QPS W-Avg W-P99
Write Only 3 clients * 15 threads 0:1 - - - 56,953 787 1,786
Read Only 3 clients * 50 threads 1:0 360,642 413 984 - - -
Read Write 3 clients * 30 threads 1:1 62,572 464 5,274 62,561 985 3,764
Read Write 3 clients * 15 threads 1:3 16,844 372 3,980 50,527 762 1,551
Read Write 3 clients * 15 threads 1:30 1,861 381 3,557 55,816 790 1,688
Read Write 3 clients * 30 threads 3:1 140,484 351 3,277 46,822 856 2,044
Read Write 3 clients * 50 threads 30:1 336,106 419 1,221 11,203 763 1,276

配置变更

+ [pegasus.server]
+ rocksdb_max_log_file_size = 8388608
+ rocksdb_log_file_time_to_roll = 86400
+ rocksdb_keep_log_file_num = 32

+ [replication]
+ plog_force_flush = false
  
- mutation_2pc_min_replica_count = 2
+ mutation_2pc_min_replica_count = 0 # 0 means it's value based table max replica count
  
+ enable_direct_io = false # Whether to enable direct I/O when download files from hdfs, default false
+ direct_io_buffer_pages = 64 # Number of pages we need to set to direct io buffer, default 64 which is recommend in my test.
+ max_concurrent_manual_emergency_checkpointing_count = 10
  
+ enable_latency_tracer_report = false
+ latency_tracer_counter_name_prefix = trace_latency
  
+ hdfs_read_limit_rate_mb_per_sec = 200
+ hdfs_write_limit_rate_mb_per_sec = 200
  
+ duplicate_log_batch_bytes = 0 # 0 means no batch before sending
  
+ [nfs]
- max_copy_rate_megabytes = 500
+ max_copy_rate_megabytes_per_disk = 0
- max_send_rate_megabytes = 500
+ max_send_rate_megabytes_per_disk = 0
  
+ [meta_server]
+ max_reserved_dropped_replicas = 0
+ bulk_load_verify_before_ingest = false
+ bulk_load_node_max_ingesting_count = 4
+ bulk_load_node_min_disk_count = 1
+ enable_concurrent_bulk_load = false
+ max_allowed_replica_count = 5
+ min_allowed_replica_count = 1
  
+ [task.LPC_WRITE_REPLICATION_LOG_SHARED]
+ enable_trace = true # true will mark the task will be traced latency if open global trace

Contributors

@acelyc111
@cauchy1988
@empiredan
@foreverneverer
@GehaFearless
@GiantKing
@happydongyaoyao
@hycdong
@levy5307
@lidingshengHHU
@neverchanje
@padmejin
@Smityz
@totalo
@WHBANG
@xxmazha
@ZhongChaoqiang