This is a template framework for creating and evaluating AI agent tasks. It provides a structured approach to:
- Define coding tasks with clear specifications
- Grade agent solutions automatically using test-based validation
- Manage multiple task difficulties (easy, medium, hard)
- Run tasks in isolated environments with proper grading
.
├── src/hud_controller/ # Main framework code
│ ├── app.py # Main MCP server and entry points
│ ├── spec.py # Core specifications (Problem, Grade, Grader)
│ ├── graders.py # Grading implementations
│ ├── grading_runner.py # Test execution and grading logic
│ ├── utils.py # Utility functions
│ ├── setup.py # Environment setup
│ ├── extractors/ # Task definitions by difficulty
│ │ ├── basic_tasks.py # Easy difficulty tasks
│ │ ├── medium_tasks.py # Medium difficulty tasks
│ │ └── hard_tasks.py # Hard difficulty tasks
│ └── tools/ # MCP tools for testing
│ ├── base.py # Base tool definitions
│ ├── bash.py # Bash execution
│ ├── computer.py # Computer interaction
│ ├── edit.py # File editing
│ └── run.py # Command running
├── pyproject.toml # Python package configuration
├── Dockerfile # Container setup
└── README.md # This file
Problems are defined using the @problem decorator with these key fields:
@problem(
id="unique_task_id",
description="Detailed task description",
hints=[], # Optional hints for agents
difficulty="easy", # or "medium", "hard"
task_type="coding",
review_level="no-review", # or other review levels
base="baseline_branch",
test="test_branch",
golden="golden_solution_branch",
)
def task_name(state: EnvironmentState) -> Grade:
"""Task implementation"""
# Return grade based on test resultsThe framework uses a sophisticated grading system:
- Grader: Base class for all graders
- SubGrade: Individual grading component with score and weight
- Grade: Final grade computed from multiple SubGrades
- AgentPatchGrader: Tests agent solutions by applying patches and running tests
Tasks are graded by:
- Copying the repository to a clean workspace
- Applying a test patch (adds failing tests)
- Applying the agent's solution patch
- Running specified test files
- Parsing JUnit XML results to determine pass/fail
Place your task in the appropriate file:
extractors/basic_tasks.py- Easy tasksextractors/medium_tasks.py- Medium tasksextractors/hard_tasks.py- Hard tasks
@problem(
id="my_task",
description="Clear description of what needs to be implemented",
hints=[
HintSpec(
hint_type="legit", # or "leaky"
text="Helpful hint text",
why_legitmate="Explanation of why this hint is fair"
)
],
difficulty="easy",
task_type="coding",
review_level="no-review",
base="my_task_baseline",
test="my_task_test",
golden="my_task_golden",
)
def my_task(state: EnvironmentState) -> Grade:
"""
Task: Description
:param state: The current state of the environment after the agent has worked
Returns:
Grade: Score (0.0 to 1.0) based on test results
Grading:
- Full score (1.0): All tests pass
- Zero score (0.0): Tests fail
"""
return Grade.from_subscores([
AgentPatchGrader.grade(
state=state,
weight=1.0,
base="my_task_baseline",
test="my_task_test",
golden="my_task_golden",
jest_test_files=[
"path/to/test/file.test.ts",
],
)
])You need three branches in your target repository:
- baseline - Starting state with the bug/missing feature
- test - Adds failing tests that verify the fix
- golden - Contains the correct solution (for reference)
Specify which test files should run:
jest_test_files- For Jest/TypeScript testsplaywright_test_files- For Playwright e2e tests (if supported)mocha_test_files- For Mocha tests (if supported)
# Install dependencies
pip install -e .
# Or with development dependencies
pip install -e ".[dev]"# Run a specific problem
setup_problem <problem_id>
grade_problem <problem_id>
# Or use the main entry point
hud_evalThe GradingRunner class handles the entire grading workflow:
- Workspace Preparation: Copies repository to isolated workspace
- Patch Application: Applies test patch, then agent solution
- Build Process: Compiles the project (with cleanup of generated files)
- Database Setup: Resets test database and runs migrations (if applicable)
- Server Management: Optionally starts server (version-dependent)
- Test Execution: Runs specified test files
- Result Collection: Parses JUnit XML results
- Cleanup: Stops servers and cleans up resources
Key environment variables used by the grading system:
MCP_TESTING_MODE- Enable testing tools (default: "1")NODE_ENV- Node environment (set to "test" for testing)WEBHOOK_FAILURE_TIME_WINDOW- Example task-specific configWEBHOOK_FAILURE_RATE_THRESHOLD- Example task-specific config
The included Dockerfile sets up the complete environment:
- Base system with required tools
- Database (PostgreSQL)
- Redis
- Node.js/Yarn
- VNC for GUI testing (if needed)
The framework currently supports Jest tests with JUnit XML output:
// jest.config.js should include:
reporters: [
'default',
['jest-junit', {
outputDirectory: '.',
outputName: 'jest_results.xml',
}]
]- Clear Descriptions: Provide detailed, unambiguous task descriptions
- Focused Scope: Each task should test one concept or skill
- Realistic Scenarios: Base tasks on real-world debugging/development scenarios
- Fair Hints: If providing hints, ensure they guide without giving away the solution
- Comprehensive Coverage: Tests should fully validate the requirement
- Clear Failures: Test failures should clearly indicate what's wrong
- Minimal Changes: Test patches should only add tests, not modify existing code
- Isolation: Tests should not depend on external state
- Clean Baseline: Baseline should be stable and buildable
- Minimal Test Patch: Only add tests that verify the specific requirement
- Correct Golden: Golden solution should be minimal and idiomatic
Create a new grader by extending the Grader base class:
class CustomGrader(Grader):
name = "CustomGrader"
@classmethod
def compute_score(cls, state: EnvironmentState, **kwargs) -> float:
# Your grading logic here
return score # 0.0 to 1.0Modify GradingRunner to support additional test frameworks:
- Add test file parameter to
__init__ - Create test execution method (similar to
run_jest_tests) - Ensure JUnit XML output
- Update
run_gradingto call new test method
- Check that baseline branch compiles successfully
- Verify no generated files interfere (runner cleans up
.jsfiles from.tssources) - Review build logs in stderr output
- Verify test patch applies cleanly to baseline
- Check that tests fail on baseline + test patch
- Confirm tests pass on baseline + test + golden patches
- Review JUnit XML output for specific failures
- Check version detection logic if server won't start
- Verify database migrations run successfully
- Ensure server port (3000) is available
This framework template is provided for guidance purposes. Customize as needed for your specific evaluation requirements.
For questions or issues:
- Review the example tasks in
extractors/directories - Check the grading logic in
grading_runner.py - Examine the problem decorator in
spec.py