Skip to content

hud-evals/operations-research-template

Repository files navigation

Operations Research Agent Evaluation Framework

Overview

This is a framework for creating and evaluating AI agent tasks focused on operations research problems. It provides a structured approach to:

  • Define optimization tasks with clear specifications
  • Grade agent solutions by comparing output files to expected values
  • Manage multiple task difficulties (easy, medium, hard)
  • Run tasks in isolated Docker environments with proper grading

Project Structure

.
├── src/hud_controller/          # Main framework code
│   ├── app.py                   # Main MCP server and entry points
│   ├── spec.py                  # Core specifications (ProblemSpec, Grade)
│   ├── grading_runner.py        # Output validation and grading logic
│   ├── utils.py                 # Utility functions
│   ├── setup.py                 # Environment setup
│   ├── problems/                # Task definitions
│   │   └── basic.py             # Problem registrations
│   └── tools/                   # MCP tools for agent interaction
│       ├── base.py              # Base tool definitions
│       ├── bash.py              # Bash execution
│       ├── edit.py              # File editing
│       ├── shell.py             # Shell commands
│       └── apply_patch.py       # Patch application
├── problem_templates/           # Data files for each problem template
│   └── default/                 # Default template (empty workspace)
├── problems/                    # Golden scripts that produce expected outputs
│   ├── or_cookbook_2_1_4.py     # Production planning LP
│   └── or_dc_inventory_6week.py # Multi-period inventory optimization
├── utils/
│   └── imagectl3.py             # Docker image build/push/validate tool
├── pyproject.toml               # Python package configuration
├── Dockerfile                   # Container setup (includes Pyomo + CBC solver)
└── README.md                    # This file

Core Concepts

1. Problem Definition

Problems are defined using the ProblemSpec data class:

ProblemSpec(
    id="cookbook_2_1_4",                    # Unique problem ID
    template="default",                      # Folder name in problem_templates/
    golden_script="or_cookbook_2_1_4.py",   # Script in problems/ that produces correct output
    description="""
Suppose you are thinking about starting up a business to produce Product X...

Write the optimal production quantities of product x and y in optimal_x.txt 
and optimal_y.txt respectively.
    """,
    difficulty="easy",
    required_outputs={"optimal_x.txt": "20", "optimal_y.txt": "60"},  # Expected outputs (trimmed)
)

2. Templates and Golden Scripts

Templates (problem_templates/):

  • Contain data files and any starter code
  • Copied to /home/ubuntu/workspace when the Docker image is built
  • Each template is a folder (e.g., default/)

Golden Scripts (problems/):

  • Python scripts that solve the optimization problem correctly using Pyomo
  • Used during validation to verify expected outputs are correct
  • Run from the workspace directory during grading

3. Output-Based Validation

Tasks are graded by:

  1. Copying the workspace to a clean grading directory
  2. Copying and running the golden script
  3. Comparing each required output file against expected values (string match after trimming)

Creating New Tasks

Step 1: Create the Template

Create a new folder in problem_templates/ with your data files:

problem_templates/
└── my_problem/
    ├── data.csv
    └── config.json

Step 2: Write the Golden Script

Create a Python script in problems/ that produces the expected outputs:

# problems/or_my_problem.py
#!/usr/bin/env python3
"""Golden script for my optimization problem."""

from pathlib import Path
from pyomo.environ import (
    ConcreteModel, Var, Objective, Constraint,
    NonNegativeReals, minimize, SolverFactory, value
)

def main():
    model = ConcreteModel()
    
    # Decision variables
    model.x = Var(domain=NonNegativeReals)
    model.y = Var(domain=NonNegativeReals)
    
    # Objective
    model.cost = Objective(expr=3*model.x + 2*model.y, sense=minimize)
    
    # Constraints
    model.c1 = Constraint(expr=model.x + model.y >= 10)
    model.c2 = Constraint(expr=2*model.x + model.y >= 15)
    
    # Solve
    SolverFactory('cbc').solve(model)
    
    # Write the output file
    Path('min_cost.txt').write_text(f"{value(model.cost):.2f}")

if __name__ == '__main__':
    main()

Step 3: Register the Problem

Add the problem to src/hud_controller/problems/basic.py (or create a new file):

from hud_controller.spec import ProblemSpec, PROBLEM_REGISTRY

PROBLEM_REGISTRY.append(
    ProblemSpec(
        id="my_problem_min_cost",
        template="my_problem",
        golden_script="or_my_problem.py",
        description="""
Your task description here. Explain the optimization problem.

Create a file called min_cost.txt with the minimum cost value.
        """,
        difficulty="easy",
        required_outputs={"min_cost.txt": "25.00"},  # Expected value after trim
    )
)

Step 4: Validate Your Problem

Use imagectl3.py to build and validate:

# Build and validate a specific problem
uv run utils/imagectl3.py myprefix_ -bv --ids my_problem_min_cost

# Build and validate all problems
uv run utils/imagectl3.py myprefix_ -bv

The validation workflow:

  1. Copies the template to a temp directory
  2. Runs the golden script
  3. Verifies all required outputs match expected values

Running Tasks

Setup Environment

uv sync

Build, Validate, and Generate JSON

# Build all images with prefix, validate, and generate JSON configs
uv run utils/imagectl3.py or_ -bvj

# Run with parallel jobs for faster builds
uv run utils/imagectl3.py or_ -bvj --jobs 4

Run HUD Eval Locally

uv run hud local-claude-hud.json claude --max-steps 50
# or for OpenAI
uv run hud local-openai-hud.json openai --max-steps 50

Run HUD Eval Remotely

Push images to a registry first:

# Build, validate, generate JSON, and push
uv run utils/imagectl3.py yourusername/or_ -bvjp --jobs 4

Then run remotely:

uv run hud remote-claude-hud.json claude --max-steps 50

Configuration

Environment Variables

Key environment variables:

  • MCP_TESTING_MODE - Enable testing tools (default: "1")
  • HINTS - Hint mode: "none" or "all" (default: "none")
  • PROBLEM_ID - The problem ID to run
  • TEMPLATE - The template folder to use

Docker Build Args

The Dockerfile accepts these build arguments:

  • TEMPLATE - Which template folder to copy to /home/ubuntu/workspace
  • PROBLEM_ID - The problem ID for the image
  • HINTS - Whether to include hints in the prompt

Hints

You can add hints to problems:

from hud_controller.spec import ProblemSpec, HintSpec, PROBLEM_REGISTRY

PROBLEM_REGISTRY.append(
    ProblemSpec(
        id="my_problem",
        # ... other fields ...
        hints=[
            HintSpec(
                hint_type="legit",
                text="Use the CBC solver with Pyomo",
                why_legitmate="This is documented in the problem setup"
            ),
        ],
    )
)

Build with hints enabled:

uv run utils/imagectl3.py prefix_ -bv --hints all

Best Practices

Task Design

  1. Clear Descriptions: Provide detailed, unambiguous optimization problem statements
  2. Focused Scope: Each task should test one concept or skill
  3. Realistic Scenarios: Base tasks on real-world operations research problems
  4. Fair Hints: If providing hints, ensure they guide without giving away the solution

Golden Script Design

  1. Use Pyomo: Model optimization problems with Pyomo and solve with CBC
  2. Clear Output: Write exactly what's expected, nothing more
  3. Error Handling: Handle missing files gracefully for better error messages
  4. Comments: Document the mathematical formulation for maintainability

Template Design

  1. Self-Contained: Include all necessary data files
  2. Reasonable Size: Keep datasets small enough for quick Docker builds
  3. Clear Naming: Use descriptive file names
  4. No Secrets: Don't include expected outputs in the template

Output Validation

  1. Trimmed Comparison: Values are compared after stripping whitespace
  2. Exact Match: The output must exactly match the expected value
  3. Multiple Outputs: You can require multiple output files
  4. Simple Values: Keep expected outputs simple (numbers, short strings)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •