Operations Research Agent Evaluation Framework

Overview

This is a framework for creating and evaluating AI agent tasks focused on operations research problems. It provides a structured approach to:

Define optimization tasks with clear specifications
Grade agent solutions by comparing output files to expected values
Manage multiple task difficulties (easy, medium, hard)
Run tasks in isolated Docker environments with proper grading

Project Structure

.
├── src/hud_controller/          # Main framework code
│   ├── app.py                   # Main MCP server and entry points
│   ├── spec.py                  # Core specifications (ProblemSpec, Grade)
│   ├── grading_runner.py        # Output validation and grading logic
│   ├── utils.py                 # Utility functions
│   ├── setup.py                 # Environment setup
│   ├── problems/                # Task definitions
│   │   └── basic.py             # Problem registrations
│   └── tools/                   # MCP tools for agent interaction
│       ├── base.py              # Base tool definitions
│       ├── bash.py              # Bash execution
│       ├── edit.py              # File editing
│       ├── shell.py             # Shell commands
│       └── apply_patch.py       # Patch application
├── problem_templates/           # Data files for each problem template
│   └── default/                 # Default template (empty workspace)
├── problems/                    # Golden scripts that produce expected outputs
│   ├── or_cookbook_2_1_4.py     # Production planning LP
│   └── or_dc_inventory_6week.py # Multi-period inventory optimization
├── utils/
│   └── imagectl3.py             # Docker image build/push/validate tool
├── pyproject.toml               # Python package configuration
├── Dockerfile                   # Container setup (includes Pyomo + CBC solver)
└── README.md                    # This file

Core Concepts

1. Problem Definition

Problems are defined using the ProblemSpec data class:

ProblemSpec(
    id="cookbook_2_1_4",                    # Unique problem ID
    template="default",                      # Folder name in problem_templates/
    golden_script="or_cookbook_2_1_4.py",   # Script in problems/ that produces correct output
    description="""
Suppose you are thinking about starting up a business to produce Product X...

Write the optimal production quantities of product x and y in optimal_x.txt 
and optimal_y.txt respectively.
    """,
    difficulty="easy",
    required_outputs={"optimal_x.txt": "20", "optimal_y.txt": "60"},  # Expected outputs (trimmed)
)

2. Templates and Golden Scripts

Templates (problem_templates/):

Contain data files and any starter code
Copied to /home/ubuntu/workspace when the Docker image is built
Each template is a folder (e.g., default/)

Golden Scripts (problems/):

Python scripts that solve the optimization problem correctly using Pyomo
Used during validation to verify expected outputs are correct
Run from the workspace directory during grading

3. Output-Based Validation

Tasks are graded by:

Copying the workspace to a clean grading directory
Copying and running the golden script
Comparing each required output file against expected values (string match after trimming)

Creating New Tasks

Step 1: Create the Template

Create a new folder in problem_templates/ with your data files:

problem_templates/
└── my_problem/
    ├── data.csv
    └── config.json

Step 2: Write the Golden Script

Create a Python script in problems/ that produces the expected outputs:

# problems/or_my_problem.py
#!/usr/bin/env python3
"""Golden script for my optimization problem."""

from pathlib import Path
from pyomo.environ import (
    ConcreteModel, Var, Objective, Constraint,
    NonNegativeReals, minimize, SolverFactory, value
)

def main():
    model = ConcreteModel()
    
    # Decision variables
    model.x = Var(domain=NonNegativeReals)
    model.y = Var(domain=NonNegativeReals)
    
    # Objective
    model.cost = Objective(expr=3*model.x + 2*model.y, sense=minimize)
    
    # Constraints
    model.c1 = Constraint(expr=model.x + model.y >= 10)
    model.c2 = Constraint(expr=2*model.x + model.y >= 15)
    
    # Solve
    SolverFactory('cbc').solve(model)
    
    # Write the output file
    Path('min_cost.txt').write_text(f"{value(model.cost):.2f}")

if __name__ == '__main__':
    main()

Step 3: Register the Problem

Add the problem to src/hud_controller/problems/basic.py (or create a new file):

from hud_controller.spec import ProblemSpec, PROBLEM_REGISTRY

PROBLEM_REGISTRY.append(
    ProblemSpec(
        id="my_problem_min_cost",
        template="my_problem",
        golden_script="or_my_problem.py",
        description="""
Your task description here. Explain the optimization problem.

Create a file called min_cost.txt with the minimum cost value.
        """,
        difficulty="easy",
        required_outputs={"min_cost.txt": "25.00"},  # Expected value after trim
    )
)

Step 4: Validate Your Problem

Use imagectl3.py to build and validate:

# Build and validate a specific problem
uv run utils/imagectl3.py myprefix_ -bv --ids my_problem_min_cost

# Build and validate all problems
uv run utils/imagectl3.py myprefix_ -bv

The validation workflow:

Copies the template to a temp directory
Runs the golden script
Verifies all required outputs match expected values

Running Tasks

Setup Environment

uv sync

Build, Validate, and Generate JSON

# Build all images with prefix, validate, and generate JSON configs
uv run utils/imagectl3.py or_ -bvj

# Run with parallel jobs for faster builds
uv run utils/imagectl3.py or_ -bvj --jobs 4

Run HUD Eval Locally

uv run hud local-claude-hud.json claude --max-steps 50
# or for OpenAI
uv run hud local-openai-hud.json openai --max-steps 50

Run HUD Eval Remotely

Push images to a registry first:

# Build, validate, generate JSON, and push
uv run utils/imagectl3.py yourusername/or_ -bvjp --jobs 4

Then run remotely:

uv run hud remote-claude-hud.json claude --max-steps 50

Configuration

Environment Variables

Key environment variables:

MCP_TESTING_MODE - Enable testing tools (default: "1")
HINTS - Hint mode: "none" or "all" (default: "none")
PROBLEM_ID - The problem ID to run
TEMPLATE - The template folder to use

Docker Build Args

The Dockerfile accepts these build arguments:

TEMPLATE - Which template folder to copy to /home/ubuntu/workspace
PROBLEM_ID - The problem ID for the image
HINTS - Whether to include hints in the prompt

Hints

You can add hints to problems:

from hud_controller.spec import ProblemSpec, HintSpec, PROBLEM_REGISTRY

PROBLEM_REGISTRY.append(
    ProblemSpec(
        id="my_problem",
        # ... other fields ...
        hints=[
            HintSpec(
                hint_type="legit",
                text="Use the CBC solver with Pyomo",
                why_legitmate="This is documented in the problem setup"
            ),
        ],
    )
)

Build with hints enabled:

uv run utils/imagectl3.py prefix_ -bv --hints all

Best Practices

Task Design

Clear Descriptions: Provide detailed, unambiguous optimization problem statements
Focused Scope: Each task should test one concept or skill
Realistic Scenarios: Base tasks on real-world operations research problems
Fair Hints: If providing hints, ensure they guide without giving away the solution

Golden Script Design

Use Pyomo: Model optimization problems with Pyomo and solve with CBC
Clear Output: Write exactly what's expected, nothing more
Error Handling: Handle missing files gracefully for better error messages
Comments: Document the mathematical formulation for maintainability

Template Design

Self-Contained: Include all necessary data files
Reasonable Size: Keep datasets small enough for quick Docker builds
Clear Naming: Use descriptive file names
No Secrets: Don't include expected outputs in the template

Output Validation

Trimmed Comparison: Values are compared after stripping whitespace
Exact Match: The output must exactly match the expected value
Multiple Outputs: You can require multiple output files
Simple Values: Keep expected outputs simple (numbers, short strings)

Name		Name	Last commit message	Last commit date
Latest commit History 36 Commits
dinit.d		dinit.d
problem_templates/default		problem_templates/default
problems		problems
src/hud_controller		src/hud_controller
utils		utils
.dockerignore		.dockerignore
.env.test		.env.test
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
pyproject.toml		pyproject.toml
total_cost.txt		total_cost.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Operations Research Agent Evaluation Framework

Overview

Project Structure

Core Concepts

1. Problem Definition

2. Templates and Golden Scripts

3. Output-Based Validation

Creating New Tasks

Step 1: Create the Template

Step 2: Write the Golden Script

Step 3: Register the Problem

Step 4: Validate Your Problem

Running Tasks

Setup Environment

Build, Validate, and Generate JSON

Run HUD Eval Locally

Run HUD Eval Remotely

Configuration

Environment Variables

Docker Build Args

Hints

Best Practices

Task Design

Golden Script Design

Template Design

Output Validation

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

hud-evals/operations-research-template

Folders and files

Latest commit

History

Repository files navigation

Operations Research Agent Evaluation Framework

Overview

Project Structure

Core Concepts

1. Problem Definition

2. Templates and Golden Scripts

3. Output-Based Validation

Creating New Tasks

Step 1: Create the Template

Step 2: Write the Golden Script

Step 3: Register the Problem

Step 4: Validate Your Problem

Running Tasks

Setup Environment

Build, Validate, and Generate JSON

Run HUD Eval Locally

Run HUD Eval Remotely

Configuration

Environment Variables

Docker Build Args

Hints

Best Practices

Task Design

Golden Script Design

Template Design

Output Validation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages