Agent Evaluation Framework Template

Overview

This is a template framework for creating and evaluating AI agent tasks. It provides a structured approach to:

Define coding tasks with clear specifications
Grade agent solutions automatically using test-based validation
Manage multiple task difficulties (easy, medium, hard)
Run tasks in isolated environments with proper grading

Project Structure

.
├── src/hud_controller/          # Main framework code
│   ├── app.py                   # Main MCP server and entry points
│   ├── spec.py                  # Core specifications (Problem, Grade, Grader)
│   ├── graders.py               # Grading implementations
│   ├── grading_runner.py        # Test execution and grading logic
│   ├── utils.py                 # Utility functions
│   ├── setup.py                 # Environment setup
│   ├── extractors/              # Task definitions by difficulty
│   │   ├── basic_tasks.py       # Easy difficulty tasks
│   │   ├── medium_tasks.py      # Medium difficulty tasks
│   │   └── hard_tasks.py        # Hard difficulty tasks
│   └── tools/                   # MCP tools for testing
│       ├── base.py              # Base tool definitions
│       ├── bash.py              # Bash execution
│       ├── computer.py          # Computer interaction
│       ├── edit.py              # File editing
│       └── run.py               # Command running
├── pyproject.toml               # Python package configuration
├── Dockerfile                   # Container setup
└── README.md                    # This file

Core Concepts

1. Problem Definition

Problems are defined using the @problem decorator with these key fields:

@problem(
    id="unique_task_id",
    description="Detailed task description",
    hints=[],  # Optional hints for agents
    difficulty="easy",  # or "medium", "hard"
    task_type="coding",
    review_level="no-review",  # or other review levels
    base="baseline_branch",
    test="test_branch", 
    golden="golden_solution_branch",
)
def task_name(state: EnvironmentState) -> Grade:
    """Task implementation"""
    # Return grade based on test results

2. Grading System

The framework uses a sophisticated grading system:

Grader: Base class for all graders
SubGrade: Individual grading component with score and weight
Grade: Final grade computed from multiple SubGrades
AgentPatchGrader: Tests agent solutions by applying patches and running tests

3. Test-Based Validation

Tasks are graded by:

Copying the repository to a clean workspace
Applying a test patch (adds failing tests)
Applying the agent's solution patch
Running specified test files
Parsing JUnit XML results to determine pass/fail

Creating New Tasks

Step 1: Choose Difficulty Level

Place your task in the appropriate file:

extractors/basic_tasks.py - Easy tasks
extractors/medium_tasks.py - Medium tasks
extractors/hard_tasks.py - Hard tasks

Step 2: Define the Task

@problem(
    id="my_task",
    description="Clear description of what needs to be implemented",
    hints=[
        HintSpec(
            hint_type="legit",  # or "leaky"
            text="Helpful hint text",
            why_legitmate="Explanation of why this hint is fair"
        )
    ],
    difficulty="easy",
    task_type="coding",
    review_level="no-review",
    base="my_task_baseline",
    test="my_task_test",
    golden="my_task_golden",
)
def my_task(state: EnvironmentState) -> Grade:
    """
    Task: Description
    
    :param state: The current state of the environment after the agent has worked
    
    Returns:
        Grade: Score (0.0 to 1.0) based on test results
    
    Grading:
        - Full score (1.0): All tests pass
        - Zero score (0.0): Tests fail
    """
    return Grade.from_subscores([
        AgentPatchGrader.grade(
            state=state,
            weight=1.0,
            base="my_task_baseline",
            test="my_task_test",
            golden="my_task_golden",
            jest_test_files=[
                "path/to/test/file.test.ts",
            ],
        )
    ])

Step 3: Prepare Git Branches

You need three branches in your target repository:

baseline - Starting state with the bug/missing feature
test - Adds failing tests that verify the fix
golden - Contains the correct solution (for reference)

Step 4: Configure Test Files

Specify which test files should run:

jest_test_files - For Jest/TypeScript tests
playwright_test_files - For Playwright e2e tests (if supported)
mocha_test_files - For Mocha tests (if supported)

Running Tasks

Setup Environment

# Install dependencies
pip install -e .

# Or with development dependencies
pip install -e ".[dev]"

Run Grading

# Run a specific problem
setup_problem <problem_id>
grade_problem <problem_id>

# Or use the main entry point
hud_eval

Grading Runner Details

The GradingRunner class handles the entire grading workflow:

Workspace Preparation: Copies repository to isolated workspace
Patch Application: Applies test patch, then agent solution
Build Process: Compiles the project (with cleanup of generated files)
Database Setup: Resets test database and runs migrations (if applicable)
Server Management: Optionally starts server (version-dependent)
Test Execution: Runs specified test files
Result Collection: Parses JUnit XML results
Cleanup: Stops servers and cleans up resources

Configuration

Environment Variables

Key environment variables used by the grading system:

MCP_TESTING_MODE - Enable testing tools (default: "1")
NODE_ENV - Node environment (set to "test" for testing)
WEBHOOK_FAILURE_TIME_WINDOW - Example task-specific config
WEBHOOK_FAILURE_RATE_THRESHOLD - Example task-specific config

Docker Configuration

The included Dockerfile sets up the complete environment:

Base system with required tools
Database (PostgreSQL)
Redis
Node.js/Yarn
VNC for GUI testing (if needed)

Testing Framework Integration

The framework currently supports Jest tests with JUnit XML output:

// jest.config.js should include:
reporters: [
  'default',
  ['jest-junit', {
    outputDirectory: '.',
    outputName: 'jest_results.xml',
  }]
]

Best Practices

Task Design

Clear Descriptions: Provide detailed, unambiguous task descriptions
Focused Scope: Each task should test one concept or skill
Realistic Scenarios: Base tasks on real-world debugging/development scenarios
Fair Hints: If providing hints, ensure they guide without giving away the solution

Test Design

Comprehensive Coverage: Tests should fully validate the requirement
Clear Failures: Test failures should clearly indicate what's wrong
Minimal Changes: Test patches should only add tests, not modify existing code
Isolation: Tests should not depend on external state

Branch Management

Clean Baseline: Baseline should be stable and buildable
Minimal Test Patch: Only add tests that verify the specific requirement
Correct Golden: Golden solution should be minimal and idiomatic

Extending the Framework

Adding New Graders

Create a new grader by extending the Grader base class:

class CustomGrader(Grader):
    name = "CustomGrader"
    
    @classmethod
    def compute_score(cls, state: EnvironmentState, **kwargs) -> float:
        # Your grading logic here
        return score  # 0.0 to 1.0

Adding New Test Frameworks

Modify GradingRunner to support additional test frameworks:

Add test file parameter to __init__
Create test execution method (similar to run_jest_tests)
Ensure JUnit XML output
Update run_grading to call new test method

Troubleshooting

Build Failures

Check that baseline branch compiles successfully
Verify no generated files interfere (runner cleans up .js files from .ts sources)
Review build logs in stderr output

Test Failures

Verify test patch applies cleanly to baseline
Check that tests fail on baseline + test patch
Confirm tests pass on baseline + test + golden patches
Review JUnit XML output for specific failures

Server Issues

Check version detection logic if server won't start
Verify database migrations run successfully
Ensure server port (3000) is available

License

This framework template is provided for guidance purposes. Customize as needed for your specific evaluation requirements.

Support

For questions or issues:

Review the example tasks in extractors/ directories
Check the grading logic in grading_runner.py
Examine the problem decorator in spec.py

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.opencode		.opencode
build_scripts		build_scripts
dinit.d		dinit.d
docker-entrypoint-initdb.d		docker-entrypoint-initdb.d
src/hud_controller		src/hud_controller
utils		utils
.dockerignore		.dockerignore
.env.test		.env.test
.gitignore		.gitignore
CUSTOMIZATION_GUIDE.md		CUSTOMIZATION_GUIDE.md
Dockerfile		Dockerfile
QUICKSTART.md		QUICKSTART.md
README.md		README.md
SETUP_GUIDE.md		SETUP_GUIDE.md
START_HERE.md		START_HERE.md
TEMPLATE_SUMMARY.md		TEMPLATE_SUMMARY.md
TEMPLATING_GUIDE.md		TEMPLATING_GUIDE.md
VERIFICATION_CHECKLIST.md		VERIFICATION_CHECKLIST.md
pyproject.toml		pyproject.toml
uv.lock		uv.lock

hud-evals/coding-template

Folders and files

Latest commit

History

Repository files navigation