This is a framework for creating and evaluating AI agent tasks focused on operations research problems. It provides a structured approach to:
- Define optimization tasks with clear specifications
- Grade agent solutions by comparing output files to expected values
- Manage multiple task difficulties (easy, medium, hard)
- Run tasks in isolated Docker environments with proper grading
.
├── src/hud_controller/ # Main framework code
│ ├── app.py # Main MCP server and entry points
│ ├── spec.py # Core specifications (ProblemSpec, Grade)
│ ├── grading_runner.py # Output validation and grading logic
│ ├── utils.py # Utility functions
│ ├── setup.py # Environment setup
│ ├── problems/ # Task definitions
│ │ └── basic.py # Problem registrations
│ └── tools/ # MCP tools for agent interaction
│ ├── base.py # Base tool definitions
│ ├── bash.py # Bash execution
│ ├── edit.py # File editing
│ ├── shell.py # Shell commands
│ └── apply_patch.py # Patch application
├── problem_templates/ # Data files for each problem template
│ └── default/ # Default template (empty workspace)
├── problems/ # Golden scripts that produce expected outputs
│ ├── or_cookbook_2_1_4.py # Production planning LP
│ └── or_dc_inventory_6week.py # Multi-period inventory optimization
├── utils/
│ └── imagectl3.py # Docker image build/push/validate tool
├── pyproject.toml # Python package configuration
├── Dockerfile # Container setup (includes Pyomo + CBC solver)
└── README.md # This file
Problems are defined using the ProblemSpec data class:
ProblemSpec(
id="cookbook_2_1_4", # Unique problem ID
template="default", # Folder name in problem_templates/
golden_script="or_cookbook_2_1_4.py", # Script in problems/ that produces correct output
description="""
Suppose you are thinking about starting up a business to produce Product X...
Write the optimal production quantities of product x and y in optimal_x.txt
and optimal_y.txt respectively.
""",
difficulty="easy",
required_outputs={"optimal_x.txt": "20", "optimal_y.txt": "60"}, # Expected outputs (trimmed)
)Templates (problem_templates/):
- Contain data files and any starter code
- Copied to
/home/ubuntu/workspacewhen the Docker image is built - Each template is a folder (e.g.,
default/)
Golden Scripts (problems/):
- Python scripts that solve the optimization problem correctly using Pyomo
- Used during validation to verify expected outputs are correct
- Run from the workspace directory during grading
Tasks are graded by:
- Copying the workspace to a clean grading directory
- Copying and running the golden script
- Comparing each required output file against expected values (string match after trimming)
Create a new folder in problem_templates/ with your data files:
problem_templates/
└── my_problem/
├── data.csv
└── config.json
Create a Python script in problems/ that produces the expected outputs:
# problems/or_my_problem.py
#!/usr/bin/env python3
"""Golden script for my optimization problem."""
from pathlib import Path
from pyomo.environ import (
ConcreteModel, Var, Objective, Constraint,
NonNegativeReals, minimize, SolverFactory, value
)
def main():
model = ConcreteModel()
# Decision variables
model.x = Var(domain=NonNegativeReals)
model.y = Var(domain=NonNegativeReals)
# Objective
model.cost = Objective(expr=3*model.x + 2*model.y, sense=minimize)
# Constraints
model.c1 = Constraint(expr=model.x + model.y >= 10)
model.c2 = Constraint(expr=2*model.x + model.y >= 15)
# Solve
SolverFactory('cbc').solve(model)
# Write the output file
Path('min_cost.txt').write_text(f"{value(model.cost):.2f}")
if __name__ == '__main__':
main()Add the problem to src/hud_controller/problems/basic.py (or create a new file):
from hud_controller.spec import ProblemSpec, PROBLEM_REGISTRY
PROBLEM_REGISTRY.append(
ProblemSpec(
id="my_problem_min_cost",
template="my_problem",
golden_script="or_my_problem.py",
description="""
Your task description here. Explain the optimization problem.
Create a file called min_cost.txt with the minimum cost value.
""",
difficulty="easy",
required_outputs={"min_cost.txt": "25.00"}, # Expected value after trim
)
)Use imagectl3.py to build and validate:
# Build and validate a specific problem
uv run utils/imagectl3.py myprefix_ -bv --ids my_problem_min_cost
# Build and validate all problems
uv run utils/imagectl3.py myprefix_ -bvThe validation workflow:
- Copies the template to a temp directory
- Runs the golden script
- Verifies all required outputs match expected values
uv sync# Build all images with prefix, validate, and generate JSON configs
uv run utils/imagectl3.py or_ -bvj
# Run with parallel jobs for faster builds
uv run utils/imagectl3.py or_ -bvj --jobs 4uv run hud local-claude-hud.json claude --max-steps 50
# or for OpenAI
uv run hud local-openai-hud.json openai --max-steps 50Push images to a registry first:
# Build, validate, generate JSON, and push
uv run utils/imagectl3.py yourusername/or_ -bvjp --jobs 4Then run remotely:
uv run hud remote-claude-hud.json claude --max-steps 50Key environment variables:
MCP_TESTING_MODE- Enable testing tools (default: "1")HINTS- Hint mode: "none" or "all" (default: "none")PROBLEM_ID- The problem ID to runTEMPLATE- The template folder to use
The Dockerfile accepts these build arguments:
TEMPLATE- Which template folder to copy to/home/ubuntu/workspacePROBLEM_ID- The problem ID for the imageHINTS- Whether to include hints in the prompt
You can add hints to problems:
from hud_controller.spec import ProblemSpec, HintSpec, PROBLEM_REGISTRY
PROBLEM_REGISTRY.append(
ProblemSpec(
id="my_problem",
# ... other fields ...
hints=[
HintSpec(
hint_type="legit",
text="Use the CBC solver with Pyomo",
why_legitmate="This is documented in the problem setup"
),
],
)
)Build with hints enabled:
uv run utils/imagectl3.py prefix_ -bv --hints all- Clear Descriptions: Provide detailed, unambiguous optimization problem statements
- Focused Scope: Each task should test one concept or skill
- Realistic Scenarios: Base tasks on real-world operations research problems
- Fair Hints: If providing hints, ensure they guide without giving away the solution
- Use Pyomo: Model optimization problems with Pyomo and solve with CBC
- Clear Output: Write exactly what's expected, nothing more
- Error Handling: Handle missing files gracefully for better error messages
- Comments: Document the mathematical formulation for maintainability
- Self-Contained: Include all necessary data files
- Reasonable Size: Keep datasets small enough for quick Docker builds
- Clear Naming: Use descriptive file names
- No Secrets: Don't include expected outputs in the template
- Trimmed Comparison: Values are compared after stripping whitespace
- Exact Match: The output must exactly match the expected value
- Multiple Outputs: You can require multiple output files
- Simple Values: Keep expected outputs simple (numbers, short strings)