This is a framework for creating and evaluating AI agent tasks focused on data science problems. It provides a structured approach to:
- Define data science tasks with clear specifications
- Grade agent solutions by comparing output files to expected values
- Manage multiple task difficulties (easy, medium, hard)
- Run tasks in isolated Docker environments with proper grading
.
├── src/hud_controller/ # Main framework code
│ ├── app.py # Main MCP server and entry points
│ ├── spec.py # Core specifications (ProblemSpec, Grade)
│ ├── grading_runner.py # Output validation and grading logic
│ ├── utils.py # Utility functions
│ ├── setup.py # Environment setup
│ ├── problems/ # Task definitions
│ │ └── basic.py # Problem registrations
│ └── tools/ # MCP tools for agent interaction
│ ├── base.py # Base tool definitions
│ ├── bash.py # Bash execution
│ ├── edit.py # File editing
│ ├── shell.py # Shell commands
│ └── apply_patch.py # Patch application
├── problem_templates/ # Data files for each problem template
│ └── titanic_dataset/ # Example: Titanic dataset template
│ └── Titanic-Dataset.csv
├── problems/ # Golden scripts that produce expected outputs
│ └── num_survivors.py # Example: counts Titanic survivors
├── utils/
│ └── imagectl3.py # Docker image build/push/validate tool
├── pyproject.toml # Python package configuration
├── Dockerfile # Container setup
└── README.md # This file
Problems are defined using the ProblemSpec data class:
ProblemSpec(
id="titanic_dataset_num_survivors", # Unique problem ID
template="titanic_dataset", # Folder name in problem_templates/
golden_script="num_survivors.py", # Script in problems/ that produces correct output
description="""
In this problem you will be working with the titanic dataset.
The dataset is available at Titanic-Dataset.csv in the current directory.
Create a file called num_survivors.txt that contains the number of survivors.
""",
difficulty="easy",
required_outputs={"num_survivors.txt": "342"}, # Expected outputs (trimmed)
)Templates (problem_templates/):
- Contain data files and any starter code
- Copied to
/home/ubuntu/workspacewhen the Docker image is built - Each template is a folder (e.g.,
titanic_dataset/)
Golden Scripts (problems/):
- Python scripts that solve the problem correctly
- Used during validation to verify expected outputs are correct
- Run from the workspace directory during grading
Tasks are graded by:
- Copying the workspace to a clean grading directory
- Copying and running the golden script
- Comparing each required output file against expected values (string match after trimming)
Create a new folder in problem_templates/ with your data files:
problem_templates/
└── my_dataset/
├── data.csv
└── config.json
Create a Python script in problems/ that produces the expected outputs:
# problems/my_solution.py
#!/usr/bin/env python3
"""Golden script for my_dataset problem."""
import csv
from pathlib import Path
def main():
# Read data from the workspace (cwd)
with open('data.csv') as f:
data = list(csv.DictReader(f))
# Compute the answer
result = len(data)
# Write the output file
Path('answer.txt').write_text(str(result))
if __name__ == '__main__':
main()Add the problem to src/hud_controller/problems/basic.py (or create a new file):
from hud_controller.spec import ProblemSpec, PROBLEM_REGISTRY
PROBLEM_REGISTRY.append(
ProblemSpec(
id="my_dataset_count",
template="my_dataset",
golden_script="my_solution.py",
description="""
Your task description here. Explain what the agent should do.
The data is available in data.csv.
Create a file called answer.txt with the number of rows.
""",
difficulty="easy",
required_outputs={"answer.txt": "100"}, # Expected value after trim
)
)Use imagectl3.py to build and validate:
# Build and validate a specific problem
uv run utils/imagectl3.py myprefix_ -bv --ids my_dataset_count
# Build and validate all problems
uv run utils/imagectl3.py myprefix_ -bvThe validation workflow:
- Copies the template to a temp directory
- Runs the golden script
- Verifies all required outputs match expected values
uv sync# Build all images with prefix, validate, and generate JSON configs
uv run utils/imagectl3.py datascience_ -bvj
# Run with parallel jobs for faster builds
uv run utils/imagectl3.py datascience_ -bvj --jobs 4uv run hud local-claude-hud.json claude --max-steps 50
# or for OpenAI
uv run hud local-openai-hud.json openai --max-steps 50Push images to a registry first:
# Build, validate, generate JSON, and push
uv run utils/imagectl3.py yourusername/datascience_ -bvjp --jobs 4Then run remotely:
uv run hud remote-claude-hud.json claude --max-steps 50Key environment variables:
MCP_TESTING_MODE- Enable testing tools (default: "1")HINTS- Hint mode: "none" or "all" (default: "none")PROBLEM_ID- The problem ID to runTEMPLATE- The template folder to use
The Dockerfile accepts these build arguments:
TEMPLATE- Which template folder to copy to/home/ubuntu/workspacePROBLEM_ID- The problem ID for the imageHINTS- Whether to include hints in the prompt
You can add hints to problems:
from hud_controller.spec import ProblemSpec, HintSpec, PROBLEM_REGISTRY
PROBLEM_REGISTRY.append(
ProblemSpec(
id="my_problem",
# ... other fields ...
hints=[
HintSpec(
hint_type="legit",
text="The Survived column contains 0 or 1",
why_legitmate="This is documented in the dataset description"
),
],
)
)Build with hints enabled:
uv run utils/imagectl3.py prefix_ -bv --hints all- Clear Descriptions: Provide detailed, unambiguous task descriptions
- Focused Scope: Each task should test one concept or skill
- Realistic Scenarios: Base tasks on real-world data science problems
- Fair Hints: If providing hints, ensure they guide without giving away the solution
- Minimal Dependencies: Use Python standard library when possible
- Clear Output: Write exactly what's expected, nothing more
- Error Handling: Handle missing files gracefully for better error messages
- Comments: Document what the script does for maintainability
- Self-Contained: Include all necessary data files
- Reasonable Size: Keep datasets small enough for quick Docker builds
- Clear Naming: Use descriptive file names
- No Secrets: Don't include expected outputs in the template
- Trimmed Comparison: Values are compared after stripping whitespace
- Exact Match: The output must exactly match the expected value
- Multiple Outputs: You can require multiple output files
- Simple Values: Keep expected outputs simple (numbers, short strings)