Skip to content

Commit 1f463ab

Browse files
committed
[docs] update readme
1 parent e352740 commit 1f463ab

File tree

6 files changed

+145
-125
lines changed

6 files changed

+145
-125
lines changed

README.md

Lines changed: 18 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
<h1 align="center"> Claw-R1: Empowering OpenClaw with <br> Advanced Agentic RL. </h1>
1+
<h1 align="center"> Claw-R1: The Data Foundation for <br> Agentic Reinforcement Learning </h1>
22

33
<p align="center">
44
<a href="https://agentr1.github.io/"><img src="https://img.shields.io/badge/Project-Home-orange.svg" alt="Project Home"></a>
@@ -13,37 +13,39 @@
1313

1414
- **[2026.03.06]** 📖 **Claw-R1 Documentation Released.** Project page and documentation are now available at [Claw-R1 Project Page](https://agentr1.github.io/) and [Claw-R1 docs](https://agentr1.github.io/Claw-R1/).
1515

16-
- **[2026.03.03]** 🚧 **Claw-R1 Project Init.** We are actively updating the framework. Stay tuned for more features and documentation.
16+
- **[2026.03.03]** 🚧 **Claw-R1 Project Init.** We are actively developing the framework. Stay tuned for more features and documentation.
1717

1818
## Overview
1919

20-
**Agentic RL** has become the dominant approach for training powerful LLM agents. Meanwhile, **General Agents** (e.g., OpenClaw, Claude Code, Open Code, etc.) have emerged as game-changing systems that redefine what agents can do. Yet there remains critical gaps:
20+
The **Agentic RL** ecosystem is thriving — frameworks like [verl](https://github.com/volcengine/verl), [Agent-R1](https://github.com/0russwest0/Agent-R1), and [MiniMax Forge](https://www.minimax.io/news/forge-scalable-agent-rl-framework-and-algorithm) have made remarkable progress in RL runtime and training algorithms. Meanwhile, **General Agents** (e.g., [OpenClaw](https://github.com/openclaw/openclaw), Claude Code, Open Code) are producing interaction data that is far richer and more complex than traditional ReAct trajectories.
2121

22-
- **General Agent for Agentic RL**: Traditional Agentic RL frameworks typically rely on simple agents like ReAct. General agents (e.g., OpenClaw, Claude Code, Open Code) offer far richer capabilities—but existing RL pipelines were not designed for them.
22+
As agents grow more capable, a critical question emerges: **How do we systematically collect, evaluate, and curate high-quality training data from diverse agent interactions?** This is a relatively under-explored yet important direction — especially when human feedback is available as a natural quality signal.
2323

24-
- **Agentic RL for General Agent**: Modern base models have not been fully adapted to thrive inside general agent architectures. We aim to enable models to play a larger, more effective role within these next-generation agents.
25-
26-
**Claw-R1** is training framework that bridges this gap. It introduces a **Middleware Layer** (Gateway Server + DataPool) as the sole bridge between Agent Side and Training Side. Agents—white-box or black-box—access the framework via standard HTTP. This enables three modes: white-box offline, black-box offline, and black-box online service. No framework today adequately supports this paradigm—Claw-R1 is designed to fill that void.
24+
**Claw-R1** provides the **data foundation** for Agentic RL. It introduces a Middleware Layer (Gateway + DataPool) between the Agent Side and the Training Side, focusing on data collection, evaluation, and curation rather than training algorithms themselves.
2725

2826
<p align="center"><img src="./assets/framework.png" width="800px" alt="Claw-R1 Framework" /></p>
2927

3028
## Key Features
3129

32-
- **Asynchronous Training & Rollout**: Decouples RL training from rollout in the framework, enabling scalable and efficient data collection and model updates.
33-
34-
- **Agent–Training Decoupling**: Supports online-service agents where execution and training run independently. Data flows from live user requests into DataPool; the Trainer continuously fetches batches for training—no dataset required.
30+
- **Universal Data Collection**: White-box agents submit Steps via API; black-box agents integrate by simply pointing `base_url` to the Gateway (zero code changes); online services collect data from live user interactions in real-time.
3531

36-
- **Zero-Code Intrusion**: Black-box agents (LangChain, AutoGen, CrewAI, etc.) integrate with zero modification—just point `base_url` to the Gateway. The framework automatically collects interaction data and trains models.
32+
- **Data Evaluation & Curation**: Multi-dimensional reward system (rule-based / discriminative RM / generative RM), human feedback signal integration, policy version tracking for freshness-aware curation, and channel-based data partitioning.
3733

34+
- **Flexible Data Serving**: Pluggable `TrainingBackend` to convert curated data into any training engine's native format, with GRPO-aware grouping, train/val channel isolation, and real-time monitoring.
3835

3936
## Get Started
4037

41-
Explore our comprehensive documentation for setup, configuration, and advanced usage:
42-
4338
- 📖 **[Full Documentation](https://agentr1.github.io/Claw-R1/)**
44-
- 🚀 [Installation Guide](docs/Installation.md)
39+
- 🚀 [Installation Guide](docs/getting-started/installation.md)
4540
- 🛠️ [Architecture Overview](https://agentr1.github.io/Claw-R1/components/)
4641

42+
## Roadmap
43+
44+
- [ ] **Data Quality Dashboard**: Visual monitoring of data quality metrics, reward distributions, and collection statistics.
45+
- [ ] **Human Feedback Pipeline**: Structured pipeline for capturing and integrating explicit and implicit human feedback signals from online agent services.
46+
- [ ] **Dataset Export & Versioning**: Export curated datasets with full provenance tracking for reproducibility and sharing.
47+
- [ ] **Extended TrainingBackend Support**: Native adapters for additional RL frameworks beyond verl.
48+
4749
## Contributors
4850

4951
**Team Members**: Daoyu Wang, Jie Ouyang, Shuo Yu
@@ -52,16 +54,15 @@ Explore our comprehensive documentation for setup, configuration, and advanced u
5254

5355
**Affiliation**: State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China
5456

55-
5657
## Acknowledgements
5758

58-
Claw-R1 builds upon [Agent-R1](https://github.com/0russwest0/Agent-R1). We extend our gratitude to [MiniMax Forge](https://www.minimax.io/news/forge-scalable-agent-rl-framework-and-algorithm) for their architectural insights on the Middleware design, and to [rLLM](https://github.com/rllm-org/rllm) for their pioneering work on RL framework design for language agents. We also thank [OpenClaw](https://github.com/openclaw/openclaw) for their remarkable work on personal AI assistants—the modern agent paradigm that inspires our vision. We are grateful to the broader Agentic RL community and all contributors for their support.
59+
We extend our gratitude to [Agent-R1](https://github.com/0russwest0/Agent-R1), [MiniMax Forge](https://www.minimax.io/news/forge-scalable-agent-rl-framework-and-algorithm), [verl](https://github.com/volcengine/verl), and [rLLM](https://github.com/rllm-org/rllm) for their pioneering work on Agentic RL training infrastructure. We also thank [OpenClaw](https://github.com/openclaw/openclaw) for their remarkable work on personal AI assistants. We are grateful to the broader Agentic RL community and all contributors for their support.
5960

6061
## Citation
6162

6263
```bibtex
6364
@misc{clawr1-2026,
64-
title={Claw-R1: Agentic RL for Modern Agents},
65+
title={Claw-R1: The Data Foundation for Agentic Reinforcement Learning},
6566
author={Wang, Daoyu and Ouyang, Jie and Yu, Shuo and Cheng, Mingyue and Liu, Qi},
6667
year={2025},
6768
howpublished={\url{https://github.com/AgentR1/Claw-R1}},

docs/components/datapool.md

Lines changed: 13 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -1,19 +1,20 @@
11
# DataPool
22

3-
DataPool 是一个 **Ray Actor**,作为 Agent 侧(Gateway)和 Training 侧(Trainer)之间的中央 trajectory 缓冲区
3+
DataPool 是 Claw-R1 的**数据管理核心** — 一个 Ray Actor,承担着 Agent 交互数据的存储、索引、质量追踪、分区管理和按需供给。它不仅是 Agent 侧与 Training 侧之间的缓冲区,更是整个数据基础设施的中枢
44

55
## 在架构中的角色
66

77
```
8-
Gateway ──► DataPool.submit_step() (异步,fire-and-forget)
9-
Trainer ◄── DataPool.fetch_batch() (阻塞拉取,batch-ready 的组)
8+
Gateway ──► DataPool.submit_steps() (数据采集:异步写入)
9+
Trainer ◄── DataPool.fetch_batch() (数据供给:阻塞拉取就绪组)
10+
DataPool.get_statistics() (数据监控:实时统计)
1011
```
1112

12-
DataPool 完全解耦了写入速度(由 Agent 请求频率驱动)和读取速度(由训练吞吐量驱动)。双方互不等待。
13+
DataPool 完全解耦了数据采集速度(由 Agent 请求频率驱动)和数据消费速度(由训练吞吐量驱动)。双方互不等待。
1314

14-
## Channel 系统
15+
## Channel 系统(数据分区)
1516

16-
DataPool 通过 **channel** 对数据进行分区。默认 channel 为 `"train"`,验证流程使用 `"val"` channel 以隔离数据。
17+
DataPool 通过 **channel** 对数据进行分区管理。默认 channel 为 `"train"`,验证流程使用 `"val"` channel 以隔离数据。
1718

1819
```python
1920
# 训练数据
@@ -81,18 +82,18 @@ while True:
8182
train_on_batch(batch)
8283
```
8384

84-
## 容量管理
85+
## 容量管理与背压控制
8586

86-
当设置 `max_queue_size` 时,DataPool 在队列满时丢弃最旧的就绪组,防止 Trainer 较慢时内存无限增长
87+
当设置 `max_queue_size` 时,DataPool 在队列满时自动丢弃最旧的就绪组,防止数据堆积导致内存无限增长。这种背压机制也确保了训练侧消费的数据尽可能新鲜
8788

8889
```yaml
8990
async_training:
9091
max_queue_size: null # null = 无限
9192
```
9293
93-
## Training Backend
94+
## Training Backend(数据供给适配)
9495
95-
DataPool 使用 `TrainingBackend` 将 `list[Step]` 转换为训练引擎的原生格式
96+
DataPool 通过可插拔的 `TrainingBackend` 将 `list[Step]` 转换为任意训练引擎的原生格式,实现数据管理与训练框架的解耦
9697

9798
```python
9899
class VerlBackend(TrainingBackend):
@@ -106,9 +107,9 @@ class VerlBackend(TrainingBackend):
106107
...
107108
```
108109

109-
## Off-policy 支持
110+
## Off-policy 支持(数据新鲜度管控)
110111

111-
Trainer 可以通过 staleness threshold 配置来处理历史(off-policy)数据:
112+
每个 Step 都记录了生成时的 `policy_version`,DataPool 和 Trainer 可以据此判断数据的新鲜度。Trainer 通过 staleness threshold 配置来处理历史(off-policy)数据:
112113

113114
```yaml
114115
async_training:

docs/components/index.md

Lines changed: 27 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -1,77 +1,83 @@
11
# Components
22

3-
Claw-R1 由六个独立可运行的组件组成,通过 HTTP 和 Ray RPC 通信。
3+
Claw-R1 的组件围绕**数据流**组织:从 Agent 交互的采集,到数据的管理与质量评估,再到向训练引擎的供给。各组件通过 HTTP 和 Ray RPC 通信。
44

55
<div class="grid cards" markdown>
66

7-
- **Gateway Server**
7+
- **Gateway Server** · 数据采集入口
88

99
---
1010

11-
FastAPI HTTP 服务。所有 Agent LLM 调用的网络层入口。管理 vLLM 负载均衡,自动收集训练数据并提交到 DataPool。
11+
FastAPI HTTP 服务。所有 Agent LLM 调用的统一入口,自动从交互中采集训练数据(Step)并提交到 DataPool。支持白盒显式提交和黑盒自动采集两种模式
1212

1313
[:octicons-arrow-right-24: Gateway Server](gateway.md)
1414

15-
- **DataPool**
15+
- **DataPool** · 数据管理核心
1616

1717
---
1818

19-
Ray Actor。Agent 侧和 Training 侧之间的中央 trajectory 缓冲区。支持 Gateway 的异步写入和 Trainer 的批量读取
19+
Ray Actor。Claw-R1 的数据管理中枢 — 存储、索引、分区和供给交互数据。支持 Channel 隔离、GRPO 分组、容量背压控制和实时统计监控
2020

2121
[:octicons-arrow-right-24: DataPool](datapool.md)
2222

23-
- **Agent Flow**
23+
- **Reward System** · 数据质量评估
2424

2525
---
2626

27-
Agent 执行生命周期管理框架。支持白盒(Python)和黑盒(OpenAI API)两种模式
27+
`RewardLoopWorker` Ray Actor。多维度数据质量评估:rule-based、discriminative RM、generative RM,以及人类反馈信号的整合
2828

29-
[:octicons-arrow-right-24: Agent Flow](agent-flow.md)
29+
[:octicons-arrow-right-24: Reward System](reward-system.md)
3030

31-
- **Black-box Agent**
31+
- **Agent Flow** · 白盒数据采集
3232

3333
---
3434

35-
黑盒 Agent 系统。任何使用 OpenAI 兼容 API 的 Agent 通过 `base_url` 透明接入训练循环
35+
Agent 执行生命周期管理。白盒 Agent 通过 Python API 显式提交 Step,完整控制数据采集过程
3636

37-
[:octicons-arrow-right-24: Black-box Agent](blackbox-agent.md)
37+
[:octicons-arrow-right-24: Agent Flow](agent-flow.md)
3838

39-
- **Async Training**
39+
- **Black-box Agent** · 黑盒数据采集
4040

4141
---
4242

43-
`AsyncTrainer``AsyncRollouter` Ray Actor。持续、非阻塞的训练循环,带参数同步
43+
零代码侵入的黑盒 Agent 接入。任何使用 OpenAI 兼容 API 的 Agent 通过 `base_url` 透明接入,Gateway 自动采集交互数据
4444

45-
[:octicons-arrow-right-24: Async Training](async-training.md)
45+
[:octicons-arrow-right-24: Black-box Agent](blackbox-agent.md)
4646

47-
- **Reward System**
47+
- **Async Training** · 数据消费与训练
4848

4949
---
5050

51-
`RewardLoopWorker` Ray Actor。从 rule-based、discriminative 或 generative reward model 计算 step 级别的 reward
51+
`AsyncTrainer``AsyncRollouter` Ray Actor。持续从 DataPool 消费高质量数据进行训练,带参数同步
5252

53-
[:octicons-arrow-right-24: Reward System](reward-system.md)
53+
[:octicons-arrow-right-24: Async Training](async-training.md)
5454

5555
</div>
5656

57-
## 组件交互图
57+
## 数据流全景
5858

5959
```
60+
数据采集层
6061
┌─────────────────────────────────────────┐
6162
黑盒 Agent ────────►│ │
6263
(base_url) │ GATEWAY SERVER │
6364
│ (FastAPI, 端口 8100) │
64-
白盒 Agent ────────►│
65+
白盒 Agent ────────►│ 自动采集交互 Step
6566
(AgentFlow) └────────────┬────────────────────────────┘
66-
│ Ray RPC (submit_step)
67+
│ Ray RPC (submit_steps)
6768
69+
数据管理层
6870
┌─────────────────────────────────────────┐
6971
│ DATAPOOL │
7072
│ (Ray Actor) │
71-
│ Channel: train / val │
73+
│ │
74+
│ • 存储与索引 • Channel 分区 │
75+
│ • GRPO 分组 • 容量背压控制 │
76+
│ • 质量评估 • 实时统计监控 │
7277
└──────────────────┬──────────────────────┘
7378
│ fetch_batch()
7479
80+
数据消费层
7581
┌─────────────────────────────────────────┐
7682
│ ASYNC TRAINER │
7783
│ (Ray Actor, Training GPU Pool) │

docs/concepts/index.md

Lines changed: 19 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -1,59 +1,60 @@
11
# Core Concepts
22

3-
Claw-R1 的设计围绕三个核心概念展开,它们共同构成一个闭环飞轮
3+
Claw-R1 的设计围绕三个核心概念展开**通用数据采集****数据中间件管理****数据驱动的持续进化**。它们共同构成一个从采集到训练的数据飞轮
44

55
<div class="grid cards" markdown>
66

7-
- **Base URL Integration**
7+
- **Base URL Integration** · 通用数据采集
88

99
---
1010

11-
零代码侵入的黑盒 Agent 接入机制。任何使用 OpenAI 兼容 API 的 Agent 只需修改 `base_url` 即可透明接入训练系统
11+
零代码侵入的 Agent 数据采集机制。任何使用 OpenAI 兼容 API 的 Agent 只需修改 `base_url`,Gateway 即可自动采集其交互数据
1212

1313
[:octicons-arrow-right-24: Base URL Integration](base-url-integration.md)
1414

15-
- **Middleware Layer**
15+
- **Middleware Layer** · 数据中间件
1616

1717
---
1818

19-
Gateway + DataPool 中间件架构。完全解耦 Agent 侧和 Training 侧,支持异步数据收集和训练
19+
Gateway + DataPool 数据基础设施。统一解决数据的采集入口、质量管理、分区缓冲和按需供给
2020

2121
[:octicons-arrow-right-24: Middleware Layer](middleware-layer.md)
2222

23-
- **Production Scenario**
23+
- **Production Scenario** · 数据驱动进化
2424

2525
---
2626

27-
"部署 = 训练" 范式。Agent 在服务用户的同时持续收集数据和改进,无需离线重训
27+
"部署 = 训练" 范式。Agent 在服务用户的同时持续采集交互数据,用户行为天然成为数据质量信号,驱动模型持续进化
2828

2929
[:octicons-arrow-right-24: Production Scenario](production-scenario.md)
3030

3131
</div>
3232

33-
## 闭环飞轮
33+
## 数据飞轮
3434

3535
```
3636
base_url
3737
┌────────────┐
38-
黑盒 Agent │
39-
│ (任意框架)
38+
任意 Agent │
39+
│ (白盒/黑盒)
4040
└──────┬─────┘
4141
│ OpenAI API
4242
4343
┌──────────────────┐
44-
│ Gateway │ ← Middleware Layer
45-
│ (自动收集数据)
44+
│ Gateway │ ← 数据采集入口
45+
│ (自动采集 Step)
4646
└────────┬─────────┘
4747
4848
4949
┌──────────────────┐
50-
│ DataPool │
50+
│ DataPool │ ← 数据管理核心
51+
│ (评估·筛选·供给) │ (质量评估 + 分区管理)
5152
└────────┬─────────┘
5253
5354
5455
┌──────────────────┐
55-
│ Trainer │ ← Production Scenario
56-
│ (持续训练) │ (部署 = 训练)
56+
│ Trainer │ ← 数据消费
57+
│ (持续训练) │
5758
└────────┬─────────┘
5859
│ 权重同步
5960
@@ -65,6 +66,6 @@ Claw-R1 的设计围绕三个核心概念展开,它们共同构成一个闭环
6566

6667
三个概念的协同:
6768

68-
1. **Base URL** 让任何 Agent 零成本接入
69-
2. **Middleware** 异步收集和缓冲训练数据
70-
3. **Production Scenario** 让模型在服务中持续进化
69+
1. **Base URL** 让任何 Agent 的交互数据零成本被采集
70+
2. **Middleware** 管理数据的质量、分区和供给
71+
3. **Production Scenario** 让人类反馈信号自然融入数据,驱动模型持续进化

0 commit comments

Comments
 (0)