🤖 Terminal Bench Agentic Data Pipeline

TL;DR: I built a multi-agent system with 20+ Claude Code instances working in parallel to generate high-quality training data for terminal-based coding tasks, producing 331+ validated datapoints for my scalable RL training project.

Why build an agentic data pipeline?

Because generating quality training data for coding agents requires creativity, validation, and scale—tasks perfectly suited for AI collaboration.

🎬 See It In Action

Watch 20+ agents working in parallel to generate training data at scale:

20x agents in parallel

20x_agents_better.mov

6x agents in parallel (easier to follow!)

6x_agents_clear_view.mov

🏗️ High-Level Architecture

The pipeline transforms Terminal Bench evaluation tasks into diverse training datapoints through three specialized agent stages:

Seed tasks from Terminal Bench → Idea Agents → Creative variations
Draft ideas → Builder Agents → Complete executable datapoints
Built datapoints → Review Agents → Production-ready training data

All agents work independently in parallel, coordinated by a central Task Manager that prevents duplication and handles failures.

📈 Pipeline Results

Production Output

331 validated datapoints generated
20+ agents working concurrently
3 specialized agent types with distinct roles
100% Docker-validated environments

Datapoint Categories Generated

Category	Count	Description
Software Engineering	97	API development, CLI tools, code architecture
System Administration	59	Server config, process management, monitoring
Security	42	Vulnerability fixes, authentication, encryption
Data Processing	37	ETL pipelines, data parsing, transformations
Debugging	28	Fix race conditions, memory leaks, logic errors
Machine Learning	17	Model training, data preprocessing, evaluation
File Operations	15	File parsing, I/O optimization, format conversion
Scientific Computing	15	Numerical methods, simulations, data analysis

Most Common Technologies

Python: Used in 196 datapoints (59%)
CLI Tools: 47 datapoints
APIs: 30 datapoints
C: 22 datapoints

🔄 Agent Roles & Workflow

Each agent type operates independently with specialized tools and clear responsibilities:

🎨 Idea Generation Agents

Input: Seed tasks from Terminal Bench
Process: Analyze core skills → Generate n×multiplier variations → Select best ideas
Output: Draft specifications in shared workspace

Key Innovation: Refinement criteria provided only AFTER brainstorming to maximize creativity

🔨 Datapoint Builder Agents

Input: Draft specifications from idea agents
Process: Build complete scenarios → Validate via software script → Iterate until passing
Output: Executable datapoints with all required components

Validation Requirements:

✅ Dockerfile builds successfully
✅ Tests fail before agent intervention
✅ All dependencies present
✅ Test weights sum to 1.0

🔍 Quality Review Agents

Input: Validated datapoints from builders
Process: Check quality standards → Edit if needed → Re-validate → Categorize
Output: Approved datapoints with metadata or rejection with reasons

📊 Task Manager System

The Task Manager enables parallel agent coordination without complex handoffs:

# Agent claims work atomically
task = tm.get_next_task("idea-agent-07", task_types=["generate_idea"])

# Process independently
draft_ideas = generate_creative_variations(task["seed_data"])

# Complete with results
tm.complete_task(task["id"], "idea-agent-07", {"drafts": draft_ideas})

Features:

Atomic task claiming (no collisions)
Automatic timeout recovery
Parent-child task tracking
Real-time status monitoring

📁 Datapoint Structure

Each training datapoint contains:

draft_001_a/
├── prompt.md       # 1-3 sentence task (e.g., "Auth times out with 100+ users. Fix it.")
├── dockerfile      # Ubuntu 24.04 (or similar) environment setup
├── tests.py        # Pytest verification functions
├── weights.json    # Test importance distribution
└── files/          # Additional resources
    ├── app.py      # Broken code to fix
    └── config.json # Configuration files

✅ Validation Pipeline

Shared validation tools ensure quality across all agents:

Docker Build: Environment must build successfully
Test Discovery: Pytest must find all test functions
Fail-First: Tests must fail in initial state
Dependency Check: All required packages present
Weight Validation: Test weights sum to exactly 1.0

🛠️ Infrastructure & Tools

Workspace Organization

agents/
├── idea_agent_workspace/      # Idea generation tools & instructions
├── dp_builder_workspace/      # Building tools & staging areas
└── review_agent_workspace/    # Review tools & quality checks

shared_workspace/              # Common filesystem for all agents
shared_tools/                  # Validation, patching, utilities
task_manager/                  # Coordination & state management

Agent-Specific Tools

Idea Agents: get_task_parameters.py, get_idea_refinement_details.py
Builder Agents: create_dp.py, add_dp_to_review.py
Review Agents: approve_datapoint.py, cancel_datapoint.py, show_categories_tags.py

Shared Tools

validate_datapoint.py - Complete validation suite
patch_dp.py - Update datapoint components
patch_additional_files.py - Manage resource files

🚀 Getting Started

# Clone repository
git clone https://github.com/Danau5tin/tbench-agentic-data-pipeline.git
cd tbench-agentic-data-pipeline

# Install dependencies
uv sync

# Initialize seed tasks
python init_seed_tasks.py <path_to_terminal_bench_tasks>

# Launch agents with Claude Code
# Idea Agent:
"See @agents/idea_agent_workspace/workflow_instructions.md - you are the idea generation agent, go!"

# Builder Agent:
"See @agents/dp_builder_workspace/workflow_instructions.md - you are the datapoint builder agent, go!"

# Review Agent:
"See @agents/review_agent_workspace/workflow_instructions.md - you are the quality review agent, go!"

💡 Key Design Decisions

Why Multiple Agent Types?

Separation of concerns: Each agent excels at one task
Parallel scaling: Multiple instances per type
Quality gates: Three-stage validation ensures high standards

Why Shared Filesystem?

Simplicity: No complex message passing
Reliability: File operations are atomic
Debugging: Easy to inspect intermediate states

Why Task Manager?

Coordination: Prevents duplicate work
Recovery: Handles agent failures gracefully
Monitoring: Real-time pipeline visibility

Built with Claude Code 🤖 - This entire multi-agent system was developed using Claude Code, demonstrating the power of AI agents building infrastructure for other AI agents.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

🤖 Terminal Bench Agentic Data Pipeline

🎬 See It In Action

20x agents in parallel

6x agents in parallel (easier to follow!)

📚 Table of Contents

🏗️ High-Level Architecture

📈 Pipeline Results

Production Output

Datapoint Categories Generated

Most Common Technologies

🔄 Agent Roles & Workflow

🎨 Idea Generation Agents

🔨 Datapoint Builder Agents

🔍 Quality Review Agents

📊 Task Manager System

📁 Datapoint Structure

✅ Validation Pipeline

🛠️ Infrastructure & Tools

Workspace Organization

Agent-Specific Tools

Shared Tools

🚀 Getting Started

💡 Key Design Decisions

Why Multiple Agent Types?

Why Shared Filesystem?

Why Task Manager?

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

🤖 Terminal Bench Agentic Data Pipeline

🎬 See It In Action

20x agents in parallel

6x agents in parallel (easier to follow!)

📚 Table of Contents

🏗️ High-Level Architecture

📈 Pipeline Results

Production Output

Datapoint Categories Generated

Most Common Technologies

🔄 Agent Roles & Workflow

🎨 Idea Generation Agents

🔨 Datapoint Builder Agents

🔍 Quality Review Agents

📊 Task Manager System

📁 Datapoint Structure

✅ Validation Pipeline

🛠️ Infrastructure & Tools

Workspace Organization

Agent-Specific Tools

Shared Tools

🚀 Getting Started

💡 Key Design Decisions

Why Multiple Agent Types?

Why Shared Filesystem?

Why Task Manager?