Skip to content

Commit 45f16f5

Browse files
authored
Merge pull request #67 from hud-evals/l/mcp-additions
mcp additions
2 parents f4773cf + fdc63d6 commit 45f16f5

File tree

128 files changed

+18889
-1363
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

128 files changed

+18889
-1363
lines changed

.github/workflows/ci.yml

Lines changed: 14 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,7 +11,7 @@ jobs:
1111
runs-on: ubuntu-latest
1212
strategy:
1313
matrix:
14-
python-version: ["3.10", "3.11", "3.12", "3.13"]
14+
python-version: ["3.11", "3.12", "3.13"]
1515

1616
steps:
1717
- name: Check out code
@@ -23,7 +23,20 @@ jobs:
2323
- name: Install Python
2424
run: uv python install ${{ matrix.python-version }}
2525

26+
- name: Setup virtual display
27+
run: |
28+
sudo apt-get update
29+
sudo apt-get install -y xvfb
30+
Xvfb :99 -screen 0 1920x1080x24 -ac &
31+
sleep 3
32+
33+
- name: Install Playwright browsers
34+
run: uv run --with=".[dev]" playwright install chromium
35+
2636
- name: Run tests
37+
env:
38+
DISPLAY: :99
39+
XAUTHORITY: /dev/null
2740
run: uv run --python ${{ matrix.python-version }} --with=".[dev]" pytest --rootdir=hud --cov --cov-report=''
2841

2942
lint-ruff:

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,3 +27,5 @@ test.json
2727
TODO.md
2828

2929
.coverage
30+
31+
*.log

environments/README.md

Lines changed: 163 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,163 @@
1+
# HUD MCP Environment Requirements
2+
3+
Quick guide for creating HUD-compatible MCP environments.
4+
5+
## Required MCP Tools
6+
7+
### 1. `setup` Tool
8+
```python
9+
@mcp.tool()
10+
async def setup(config: dict) -> dict:
11+
"""Initialize environment from task.setup config (any format)."""
12+
# Handle config however your environment needs
13+
return {"status": "success"} # Return format is flexible
14+
```
15+
16+
### 2. `evaluate` Tool
17+
```python
18+
@mcp.tool()
19+
async def evaluate(config: dict) -> dict:
20+
"""Evaluate task completion from task.evaluate config."""
21+
# Your evaluation logic
22+
return {
23+
"reward": 1.0, # Required: 0.0-1.0 score
24+
"done": True, # Required: completion flag
25+
"info": {} # Optional: metadata
26+
}
27+
```
28+
29+
### 3. Interaction Tool(s)
30+
At least one tool for agent interaction during task execution:
31+
32+
```python
33+
# Option A: Use HUD computer tool
34+
from hud.tools import HudComputerTool
35+
from hud.tools.helper import register_instance_tool
36+
37+
register_instance_tool(mcp, "computer", HudComputerTool())
38+
39+
# Option B: Custom API tool
40+
@mcp.tool()
41+
async def api_request(url: str, method: str = "GET", data: dict = None) -> dict:
42+
# Your API logic
43+
pass
44+
```
45+
46+
**Note**: `setup` and `evaluate` are **lifecycle** MCP tools that:
47+
- Are **automatically discovered** by the MCP client (always available to framework)
48+
- Are **filtered out** from LLM conversation (not in `allowed_tools`)
49+
- Are **called programmatically** by the agent during task execution
50+
51+
Only include **interaction tools** in `allowed_tools`: `computer`, `anthropic_computer`, `api_request`, etc.
52+
53+
## Config Flexibility
54+
55+
Task configs can be **any format**:
56+
```python
57+
task.setup = {"function": "reset", "args": {}}
58+
task.setup = {"id": "task_123"}
59+
task.setup = {"name": "problem_name"}
60+
task.setup = "simple_string"
61+
task.setup = ["step1", "step2"]
62+
# Your environment decides what formats to support
63+
```
64+
65+
## Minimal Example
66+
67+
```python
68+
from fastmcp import FastMCP
69+
from hud.tools import HudComputerTool
70+
from hud.tools.helper import register_instance_tool
71+
72+
mcp = FastMCP("My Environment")
73+
74+
@mcp.tool()
75+
async def setup(config: dict) -> dict:
76+
return {"status": "success"}
77+
78+
@mcp.tool()
79+
async def evaluate(config: dict) -> dict:
80+
return {"reward": 1.0, "done": True, "info": {}}
81+
82+
@mcp.initialize()
83+
async def init():
84+
register_instance_tool(mcp, "computer", HudComputerTool())
85+
86+
if __name__ == "__main__":
87+
mcp.run()
88+
```
89+
90+
## Testing Your Environment
91+
92+
### Unified Agent Interface
93+
94+
```python
95+
import asyncio
96+
from hud.mcp_agent import ClaudeMCPAgent
97+
from hud import Task
98+
from mcp_use import MCPClient
99+
100+
async def test_environment():
101+
# Connect to your environment
102+
config = {"mcpServers": {"env": {"command": "python", "args": ["my_env.py"]}}}
103+
client = MCPClient.from_dict(config)
104+
105+
# Create agent (only specify interaction tools)
106+
agent = ClaudeMCPAgent(
107+
client=client,
108+
model="claude-sonnet-4-20250514",
109+
allowed_tools=["computer", "api_request"] # Interaction tools only
110+
)
111+
112+
# Simple query
113+
result = await agent.run("Take a screenshot and describe what you see")
114+
print(f"Query result: {result}")
115+
116+
# Full task with lifecycle
117+
task = Task(
118+
prompt="Complete the todo app workflow",
119+
setup={"function": "todo_seed", "args": {"num_items": 3}},
120+
evaluate={"function": "todo_completed", "args": {"expected_count": 1}}
121+
)
122+
123+
eval_result = await agent.run(task)
124+
print(f"Task result: {eval_result}")
125+
# Returns: {"reward": 1.0, "done": True, "info": {...}}
126+
127+
await client.close_all_sessions()
128+
129+
# Run the test
130+
asyncio.run(test_environment())
131+
```
132+
133+
### Direct MCP Tool Testing
134+
135+
```python
136+
from mcp_use import MCPClient
137+
138+
# Test individual tools
139+
config = {"mcpServers": {"env": {"command": "python", "args": ["my_env.py"]}}}
140+
client = MCPClient.from_dict(config)
141+
session = await client.create_session("env")
142+
143+
# Test setup/evaluate tools directly
144+
setup_result = await session.connector.call_tool("setup", {"function": "test_setup"})
145+
eval_result = await session.connector.call_tool("evaluate", {"function": "test_eval"})
146+
```
147+
148+
## Key Features
149+
150+
**Unified Interface**: Single `agent.run()` method handles both simple queries and full task lifecycle
151+
🔄 **Automatic Lifecycle**: Setup → Execute → Evaluate phases managed automatically
152+
📋 **Flexible Config**: Support any setup/evaluate config format your environment needs
153+
🔧 **Easy Integration**: Import HUD tools with `register_instance_tool()`
154+
🛡️ **Smart Tool Filtering**: Lifecycle tools auto-discovered but hidden from LLM conversation
155+
156+
## Examples
157+
158+
### Environment Examples
159+
- [`simple_browser/`](./simple_browser/) - Computer tool + GUI automation
160+
- [`qa_controller/`](./qa_controller/) - Text-based environment
161+
162+
### Usage Examples
163+
- [`simple_task_example.py`](../examples/agents_tools/simple_task_example.py) - Complete demo with simple_browser environment

environments/novnc_ubuntu/Dockerfile

Lines changed: 0 additions & 8 deletions
This file was deleted.

environments/novnc_ubuntu/pyproject.toml

Lines changed: 0 additions & 17 deletions
This file was deleted.

environments/novnc_ubuntu/src/hud_controller/__init__.py

Lines changed: 0 additions & 7 deletions
This file was deleted.

0 commit comments

Comments
 (0)