Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 20 additions & 4 deletions .github/actions/setup-build-env/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,13 @@ runs:
- name: Install Zig toolchain
uses: mlugg/setup-zig@v2

- name: Install system build tools (Linux)
if: runner.os == 'Linux'
shell: bash
run: |
sudo apt-get update
sudo apt-get install -y binutils

- name: Cache dependencies
id: cache-deps
uses: actions/cache@v4
Expand All @@ -56,7 +63,18 @@ runs:
~/.cargo/registry/cache/
~/.cargo/git/db/
~/zig
key: ${{ runner.os }}-cargo-${{ hashFiles('**/Cargo.lock') }}
# Cache key includes:
# - OS: different OSes have different binaries
# - action.yml hash: invalidates cache when action changes
# - Python version: different Python versions may need different tools
# - Rust toolchain: nightly toolchain version
# - Tool versions: dioxus-cli version (update when tool versions change)
# - Cargo.lock: invalidates when dependencies change
key: ${{ runner.os }}-setup-${{ hashFiles('.github/actions/setup-build-env/action.yml') }}-python-${{ inputs.python-version }}-rust-nightly-dx-0.7.2-cargo-${{ hashFiles('**/Cargo.lock') }}
restore-keys: |
${{ runner.os }}-setup-${{ hashFiles('.github/actions/setup-build-env/action.yml') }}-python-${{ inputs.python-version }}-rust-nightly-dx-0.7.2-cargo-
${{ runner.os }}-setup-${{ hashFiles('.github/actions/setup-build-env/action.yml') }}-cargo-
${{ runner.os }}-cargo-

- name: Install cargo tools
if: inputs.install-cargo-tools == 'true'
Expand All @@ -65,9 +83,7 @@ runs:
test -e ~/.cargo/bin/cargo-zigbuild || cargo install cargo-zigbuild
test -e ~/.cargo/bin/rnr || cargo install rnr
test -e ~/.cargo/bin/cargo-nextest || cargo install cargo-nextest
test -e ~/.cargo/bin/cargo-binstall || cargo install cargo-binstall
test -e ~/.cargo/bin/dx || cargo binstall [email protected] -y
test -e ~/.cargo/bin/trunk || cargo install trunk --locked
test -e ~/.cargo/bin/dx || cargo install [email protected]

- name: Install Python Build Dependencies
if: inputs.install-python-deps == 'true'
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -247,4 +247,4 @@ jobs:
- name: Run Python Tests
env:
PROBING: "1"
run: maturin develop && pytest tests
run: pytest tests
36 changes: 21 additions & 15 deletions docs/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -80,22 +80,25 @@ plugins:
site_name: "Probing 文档"
nav_translations:
Home: 首页
Installation: 安装指南
Getting Started: 入门指南
Why Probing: 为什么选择 Probing
Installation: 安装
Quick Start: 快速开始
User Guide: 用户指南
SQL Analytics: SQL 分析
Memory Analysis: 内存分析
Debugging: 调试指南
Troubleshooting: 常见问题
Examples: 示例
Training Debugging: 训练调试
Memory Leak: 内存泄漏
Performance Analysis: 性能分析
Design: 设计文档
Architecture: 系统架构
Profiling: 性能分析
Distributed: 分布式
Extensibility: 扩展机制
Examples: 示例
Training Debugging: 训练调试
Memory Leak: 内存泄漏
Performance Analysis: 性能分析
Reference: 参考
API Reference: API 参考
Versions: 版本兼容性
Contributing: 贡献指南
Expand All @@ -111,29 +114,32 @@ plugins:

nav:
- Home: index.md
- Installation: installation.md
- Quick Start: quickstart.md
- Getting Started:
- Why Probing: why-probing.md
- Installation: installation.md
- Quick Start: quickstart.md
- User Guide:
- guide/index.md
- SQL Analytics: guide/sql-analytics.md
- Memory Analysis: guide/memory-analysis.md
- Debugging: guide/debugging.md
- Troubleshooting: guide/troubleshooting.md
- Examples:
- examples/index.md
- Training Debugging: examples/training-debugging.md
- Memory Leak: examples/memory-leak.md
- Performance Analysis: examples/performance-analysis.md
- Design:
- design/index.md
- Architecture: design/architecture.md
- Profiling: design/profiling.md
- Debugging: design/debugging.md
- Distributed: design/distributed.md
- Extensibility: design/extensibility.md
- Examples:
- examples/index.md
- Training Debugging: examples/training-debugging.md
- Memory Leak: examples/memory-leak.md
- Performance Analysis: examples/performance-analysis.md
- API Reference: api-reference.md
- Versions: versions.md
- Contributing: contributing.md
- Reference:
- API Reference: api-reference.md
- Versions: versions.md
- Contributing: contributing.md

extra:
generator: false
Expand Down
65 changes: 65 additions & 0 deletions docs/src/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,68 @@ hide: toc

**Probing** is a dynamic performance profiler for distributed AI applications.

## 🎯 Why Probing?

### Pain Points of Traditional Profilers

| Problem | Traditional Approach | Probing Solution |
|---------|---------------------|------------------|
| **Code modification required** | Add logging, timers, decorators | ✅ Dynamic injection, zero code changes |
| **Fixed report formats** | Predefined tables and charts | ✅ SQL queries, custom analysis |
| **Service restart needed** | Must stop and restart | ✅ Runtime attachment |
| **High learning curve** | Different syntax per tool | ✅ Familiar SQL + Python |
| **Distributed is hard** | Analyze each node separately | ✅ Unified cross-node view |

### Core Technical Advantages

=== "🔧 Dynamic Probe Injection"

Professional-grade code injection based on ptrace:

- No source code modification required
- Supports x86_64 and aarch64 architectures
- Complete state save and restore mechanism
- Production-safe implementation

=== "📊 SQL Query Engine"

Built on Apache DataFusion:

- Standard SQL syntax, no new language to learn
- Millisecond query response
- Complex aggregations, window functions
- Plugin-based data source extension

=== "🐍 Remote REPL"

Execute Python directly in target process:

- Inspect any variable or object
- Modify runtime state in real-time
- No need to stop training jobs
- Full Python environment access

=== "🌐 Distributed Support"

Native multi-node support:

- Unified cross-node queries
- Automatic process discovery
- Communication latency analysis
- Cluster-wide performance view

## 🔄 Comparison with Alternatives

| Feature | Probing | py-spy | Perfetto | torch.profiler |
|:--------|:-------:|:------:|:--------:|:--------------:|
| **Zero Intrusion** | ✅ | ✅ | ❌ | ❌ |
| **Dynamic Injection** | ✅ | ❌ | ❌ | ❌ |
| **SQL Queries** | ✅ | ❌ | ❌ | ❌ |
| **Remote REPL** | ✅ | ❌ | ❌ | ❌ |
| **Distributed Support** | ✅ | ❌ | ✅ | ⚠️ |
| **AI Framework Integration** | ✅ | ❌ | ⚠️ | ✅ |
| **Web UI** | ✅ | ❌ | ✅ | ✅ |

## Key Features

- **Zero Intrusion** - Attach to running processes without code changes
Expand All @@ -30,6 +92,9 @@ probing -t <pid> inject

# Query performance data
probing -t <pid> query "SELECT * FROM python.torch_trace LIMIT 10"

# Remote REPL debugging
probing -t <pid> repl
```

## Use Cases
Expand Down
65 changes: 65 additions & 0 deletions docs/src/index.zh.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,68 @@ hide: toc

**Probing** 是一个面向分布式 AI 应用的动态性能分析器。

## 🎯 为什么选择 Probing?

### 传统 Profiler 的痛点

| 问题 | 传统方案 | Probing 方案 |
|------|----------|--------------|
| **需要代码修改** | 添加日志、计时器、装饰器 | ✅ 动态注入,零代码修改 |
| **固定报告格式** | 预设的表格和图表 | ✅ SQL 查询,自定义分析 |
| **需要重启服务** | 必须停止再启动 | ✅ 运行时附加 |
| **学习成本高** | 各工具语法不同 | ✅ 熟悉的 SQL + Python |
| **分布式困难** | 各节点独立分析 | ✅ 跨节点统一视图 |

### 核心技术优势

=== "🔧 动态探针注入"

基于 ptrace 的专业级代码注入技术:

- 无需修改目标程序源码
- 支持 x86_64 和 aarch64 架构
- 完整的状态保存与恢复机制
- 生产环境安全可用

=== "📊 SQL 查询引擎"

基于 Apache DataFusion 构建:

- 标准 SQL 语法,无需学习新语言
- 毫秒级查询响应
- 支持复杂聚合、窗口函数
- 插件式数据源扩展

=== "🐍 远程 REPL"

直接在目标进程中执行 Python:

- 检查任意变量和对象
- 实时修改运行状态
- 无需停止训练任务
- 完整的 Python 环境

=== "🌐 分布式支持"

原生支持多节点场景:

- 统一的跨节点查询
- 自动进程发现
- 通信延迟分析
- 集群级性能视图

## 🔄 竞品对比

| 特性 | Probing | py-spy | Perfetto | torch.profiler |
|:-----|:-------:|:------:|:--------:|:--------------:|
| **零侵入** | ✅ | ✅ | ❌ | ❌ |
| **动态注入** | ✅ | ❌ | ❌ | ❌ |
| **SQL 查询** | ✅ | ❌ | ❌ | ❌ |
| **远程 REPL** | ✅ | ❌ | ❌ | ❌ |
| **分布式支持** | ✅ | ❌ | ✅ | ⚠️ |
| **AI 框架集成** | ✅ | ❌ | ⚠️ | ✅ |
| **Web UI** | ✅ | ❌ | ✅ | ✅ |

## 核心特性

- **零侵入** - 无需修改代码即可附加到运行中的进程
Expand All @@ -30,6 +92,9 @@ probing -t <pid> inject

# 查询性能数据
probing -t <pid> query "SELECT * FROM python.torch_trace LIMIT 10"

# 远程 REPL 调试
probing -t <pid> repl
```

## 使用场景
Expand Down
Loading
Loading