Skip to content

feat(raft): Implement the gRPC version of Raft Network#238

Open
MciG-ggg wants to merge 6 commits intoarana-db:feat/raftfrom
MciG-ggg:feat/raft
Open

feat(raft): Implement the gRPC version of Raft Network#238
MciG-ggg wants to merge 6 commits intoarana-db:feat/raftfrom
MciG-ggg:feat/raft

Conversation

@MciG-ggg
Copy link

@MciG-ggg MciG-ggg commented Feb 8, 2026

  • Use tonic as gRPC library to implement the raft network
  • Write a script scripts/start_node_cluster.sh which can start specific numbers of nodes to build a cluster. Due to the config is not load to the app, the script has to move the kiwi excutable file to several directories to start nodes. This should change when the config file is load to the app.

@coderabbitai
Copy link

coderabbitai bot commented Feb 8, 2026

Important

Review skipped

Auto reviews are disabled on base/target branches other than the default branch.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b4c342bb-e730-4d1e-820d-fffe6d075c2f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Tip

CodeRabbit can use Trivy to scan for security misconfigurations and secrets in Infrastructure as Code files.

Add a .trivyignore file to your project to customize which findings Trivy reports.

@MciG-ggg MciG-ggg changed the title Feat/raft Implement the gRPC version of Raft Network feat(raft): Implement the gRPC version of Raft Network Feb 8, 2026
@AlexStocks AlexStocks requested review from Copilot and guozhihao-224 and removed request for Copilot March 7, 2026 05:37
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates Kiwi’s Raft networking layer from the previous HTTP/Actix approach to a tonic-based gRPC implementation, adds the protobuf API + generated code plumbing, and introduces a helper script to spin up a local multi-node cluster.

Changes:

  • Replace Raft node-to-node RPC transport with tonic gRPC (client + server) and add reflection support.
  • Add protobuf definitions and build-time code generation for Raft core/admin/client/metrics APIs.
  • Add a scripts/start_node_cluster.sh utility to start and initialize a local cluster via grpcurl.

Reviewed changes

Copilot reviewed 22 out of 23 changed files in this pull request and generated 16 comments.

Show a summary per file
File Description
src/server/src/main.rs Starts a tonic gRPC server for Raft services instead of the old HTTP server.
src/server/Cargo.toml Adds tonic + reflection deps (currently includes some unused direct deps).
src/raft/src/state_machine.rs Silences unused fields/params.
src/raft/src/node.rs Exposes gRPC service construction and adjusts Raft timing defaults.
src/raft/src/network.rs Implements OpenRaft RaftNetwork over tonic gRPC.
src/raft/src/lib.rs Removes HTTP API module; adds gRPC/proto modules and descriptor set export.
src/raft/src/grpc/* Implements gRPC servers for core/admin/client/metrics APIs.
src/raft/src/conversion.rs Adds proto↔OpenRaft conversion helpers (with some critical correctness gaps).
src/raft/proto/*.proto Defines protobuf API surface for Raft core/admin/client/types.
src/raft/build.rs Generates tonic/prost code + descriptor set at build time.
src/raft/Cargo.toml Adds tonic/prost deps (still contains unused HTTP deps).
src/net/src/lib.rs Minor simplification of NetworkServer::new return handling.
src/conf/src/raft_type.rs Derive Default for Binlog.
src/conf/src/config.rs Adds use_grpc flag to raft config (currently unused).
scripts/start_node_cluster.sh New script to start/initialize a local cluster using grpcurl.
Cargo.lock Dependency graph updates for tonic/prost stack.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

heartbeat_interval: 500,
// 选举超时最小值:1500ms,应该 >= heartbeat_interval * 2
election_timeout_min: 1500,
// 选举超时最大值:3000ms
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

election timeout 从 500/1000ms 改成了 1500/3000ms,heartbeat 从 200ms 改成 500ms。没有说明为什么要改这些值。对于测试环境这会让 leader 选举慢很多(最慢 3 秒才能检测到 leader 失败)。如果是因为 gRPC 连接慢导致测试 flaky 才改的,那根本原因是连接复用问题(见 network.rs 的 comment),不应该靠拉大 timeout 来掩盖。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

相关参数需要使用配置文件的方式传入,最好不要写死

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

相关参数需要使用配置文件的方式传入,最好不要写死

通过config.toml传吗?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MciG-ggg 嗯嗯

EOF

# Copy binary to node directory
cp "$BINARY_PATH" "$node_dir/kiwi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

脚本通过把 kiwi 可执行文件 copy 到不同目录来启动多节点,依赖于每个节点在不同工作目录下读取本地配置文件。这个方案在配置系统完善之前只是临时 workaround,脚本注释也承认了这一点("This should change when the config file is loaded to the app")。建议脚本注释或 README_CLUSTER.md 里明确说明这是临时方案,不要让其他人以为这是正式的集群管理方式。另外,脚本没有检查 kiwi 可执行文件是否存在就直接 cp,会静默失败,缺少基本的错误检查(set -e 也没有加)。

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

没有一个README_CLUSTER.md, 添加一个还是直接在脚本注释

Copy link
Author

@MciG-ggg MciG-ggg Mar 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

脚本没有检查 kiwi 可执行文件是否存在就直接 cp,会静默失败

这个会先检查,然后编译二进制文件,如果编译失败就会退出了,应该不会有这种情况

.connect_timeout(std::time::Duration::from_secs(5))
.timeout(std::time::Duration::from_secs(30))
.connect_lazy();
let client = RaftCoreServiceClient::new(endpoint);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

new_client()connect_lazy() 创建 Channel,这步本身没问题,但问题是每次 new_client() 都创建一个新的 ChannelRaftCoreServiceClient,没有在 KiwiNetworkFactory 或上层缓存。openraft 会在每次需要给某个节点发 RPC 时调用 new_client(),意味着对同一个目标节点每次都建新 Channel。tonic 的 connect_lazy() 虽然延迟连接,但 Channel 对象本身是应该复用的(它内部管理连接池)。建议在 KiwiNetworkFactory 里用 HashMap<NodeId, RaftCoreServiceClient<Channel>> 缓存已建立的 client,避免重复创建。

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这里 new_client 作为工厂函数,在内部应该就自动维护了长连接了,可以确认一下

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这个现在应该实现了

@github-actions github-actions bot added the 📒 Documentation Improvements or additions to documentation label Mar 8, 2026
# This is the 1st commit message:

fix: implement basic Raft node functionality
- Added scripts to start a 3-node Raft cluster
- Updated Cargo.lock with new dependencies
- Implemented basic API and node logic for Raft consensus algorithm

# This is the commit message arana-db#2:

feat: add gRPC support for Raft communication

- Introduced `use_grpc` configuration option in `RaftClusterConfig` to toggle between gRPC and HTTP for node communication.
- Updated dependencies in `Cargo.toml` to include `tonic` and `prost` for gRPC support.
- Created `build.rs` to compile protobuf definitions for Raft RPC services.
- Added `raft.proto` file defining the gRPC service and message types for Raft operations.
- Implemented `RaftServiceImpl` struct to handle gRPC requests for voting, appending entries, and installing snapshots.
- Modified `KiwiNetworkFactory` to manage gRPC clients instead of HTTP clients.
- Updated `RaftApp` to create a gRPC server for handling incoming requests.
- Refactored server initialization in `main.rs` to start the gRPC server instead of the HTTP server.

# This is the commit message arana-db#3:

feat: 添加类型转换模块和 gRPC 服务实现,支持 Raft 协议的 RPC 调用

# This is the commit message arana-db#4:

Refactor Raft gRPC services and implement core functionality

- Moved gRPC service implementations into separate modules: core, admin, and client.
- Implemented RaftCoreService for handling core Raft protocol operations (vote, append entries, etc.).
- Implemented RaftAdminService for cluster management operations (initialize, add learner, change membership, remove node).
- Implemented RaftClientService and RaftMetricsService for client data operations and metrics retrieval.
- Updated network layer to use the new gRPC service clients.
- Removed the old gRPC service implementation and adjusted the RaftApp to create all gRPC services.
- Enhanced error handling and logging throughout the gRPC implementations.
- Updated Cargo.toml to include necessary dependencies for gRPC support.

# This is the commit message arana-db#5:

feat: 添加 tonic-reflection 依赖并实现 gRPC 反射服务

# This is the commit message arana-db#6:

feat: 更新 Raft 协议相关代码,优化 gRPC 服务和类型转换实现

# This is the commit message arana-db#7:

feat: 更新 Raft 协议相关代码,添加默认实现和错误处理,优化 gRPC 服务

# This is the commit message arana-db#8:

fix(network): fix AppendEntriesResponse handling and add client caching

- Fix AppendEntriesResponse conversion to handle PartialSuccess/Conflict/HigherVote variants correctly
- Add KiwiNetworkFactory client caching to avoid recreating gRPC Channels
- Adjust Raft heartbeat interval (500ms->200ms) and election timeout (1500ms->500ms/1500ms)
- Fix client_write response handling logic

# This is the commit message arana-db#9:

fix(cluster): fix cluster startup script safety issues

- Use PID file to track processes, avoid killing unrelated kiwi processes
- Improve error handling with set -euo pipefail
- Fix data cleanup to only clean cluster directory, not project root data

# This is the commit message arana-db#10:

chore(deps): clean up unused dependencies

- Remove unused actix-web, reqwest, hyper, tower dependencies from raft and server
- Update Cargo.lock to remove transitive dependencies

BREAKING CHANGE: Remove HTTP server dependencies, unify on gRPC for communication

# This is the commit message arana-db#11:

refactor(config): simplify Raft configuration

- Remove unused use_grpc config field
- Add proto comment for AppendEntriesResponse extensibility

# This is the commit message arana-db#12:

docs(raft): add network connection caching optimization design doc

- Add KiwiNetwork connection caching optimization design document
- Explain difference between connection caching and connection pooling
- Detail the dual Network instance problem and optimization solution
let network = KiwiNetwork {
client: Arc::new(Mutex::new(client)),
};

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

地址格式解析失败时静默回退到无效地址 http://127.0.0.1:0,会导致连接悄无声息地失败。

改进建议:解析失败时返回错误,不要静默回退。

type Error = tonic::Status;

fn try_into(self) -> Result<Entry<KiwiTypeConfig>, Self::Error> {
let log_id = proto_to_log_id(&self.log_id.as_ref()).unwrap_or_default();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log_id 缺失时 unwrap_or_default() 会静默返回一个默认的 LogId,可能导致条目被错误地关联到不存在的日志位置。

改进建议:缺失时返回错误而不是使用默认值。

let entries: Vec<proto::Entry> = req
.entries
.into_iter()
.filter_map(|e| e.try_into().ok())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try_into() 失败时会静默丢弃条目,不记录日志也不报错,可能导致数据静默丢失。

改进建议:遍历时记录转换失败的条目,而不是用 filter_map 静默丢弃。

_request: Request<tonic::Streaming<AppendEntriesRequest>>,
) -> Result<Response<Self::StreamAppendStream>, Status> {
// TODO: 实现 pipeline 复制
Err(Status::unimplemented("StreamAppend not implemented yet"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

这是 Raft 流水线复制的核心 RPC,当前返回 unimplemented,会导致 openraft 无法进行流水线优化。

改进建议:实现 StreamAppend RPC 或明确标注为暂不支持的 feature。

};
Ok(TonicResponse::new(WriteResponse {
response: Some(proto_response),
log_id: log_id_to_proto(None), // TODO: 返回实际 LogId
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

客户端无法获知写入的日志位置,破坏了 Raft 客户端的标准工作流程。

改进建议:返回实际提交的 log_id,而非始终为 None

Copy link
Contributor

@AlexStocks AlexStocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

从 actix HTTP 切换到 tonic gRPC 的方向对,但代码里有几个比较严重的问题,详见行内评论。

.connect_lazy();
let client = RaftCoreServiceClient::new(endpoint);

let network = KiwiNetwork {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

KiwiNetwork 实现了 Clone,但底层 client 字段是 Arc<Mutex<RaftCoreServiceClient>>。如果同一个 KiwiNetwork 被 clone 后在多个地方使用,&mut self 的 RaftNetwork 方法会通过两个 clone 出来的副本同时被调用——这在 Rust 里是 data race,是 unsound 的。

RaftNetwork trait 的方法签名是 async fn append_entries(&mut self, ...) ,设计上应该每个 target node 一个独立的 KiwiNetwork 实例。clone 出来的副本如果被 shared 到多个线程,会造成并发 mutability 问题。建议:去掉 Clone derive,或者明确约束 KiwiNetwork 不能被 shared 跨线程使用。

}

// Get or create gRPC client for the target node
let addr = node.raft_addr.clone();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

如果 Endpoint::from_shared 解析地址失败(比如格式错误),fallback 到 "http://127.0.0.1:0" 会静默创建一个无效的客户端,后续 gRPC 调用会全部失败但没有任何错误提示。建议:直接返回 Err,不要 fallback。

}
OpenRaftAppendEntriesResponse::PartialSuccess(_) => {
// PartialSuccess 表示部分成功,对客户端来说算成功
Ok(Response::new(AppendEntriesResponse { success: true }))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

install_snapshot 的实现:消费完整个流然后返回 failed_precondition 错误。这个行为有误导性——客户端会以为服务不支持这个 RPC,但实际上它接收了数据只是丢弃了。如果将来实现了快照安装,这里很容易忘记删掉这个 workaround。建议:加一个 TODO 注释说明这是临时的,并加上已知问题的标记。

// OpenRaft → Proto
// 这些都应该返回 success=false,而不是 gRPC 错误
match raft_resp {
OpenRaftAppendEntriesResponse::Success => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stream_append 直接返回 UNIMPLEMENTED,但 openraft 在某些场景下会依赖 stream_append 做 pipeline 复制。如果这个 RPC 不工作,节点间日志复制会回退到同步方式,影响性能。建议:在 PR 描述里说明这个限制。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

📒 Documentation Improvements or additions to documentation ✏️ Feature new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants