Skip to content

Commit cea28de

Browse files
authored
Merge pull request #235 from hud-evals/v5
V5
2 parents cee3107 + b35738b commit cea28de

File tree

449 files changed

+17898
-46468
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

449 files changed

+17898
-46468
lines changed

.github/workflows/ci.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,4 +59,4 @@ jobs:
5959
uses: astral-sh/setup-uv@v5
6060

6161
- name: Run pyright
62-
run: uv run --with=".[rl,dev]" pyright
62+
run: uv run --with=".[dev]" pyright

README.md

Lines changed: 70 additions & 332 deletions
Large diffs are not rendered by default.
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
---
2+
title: "Testing Environments"
3+
description: "Test scenarios, tools, and environment logic locally"
4+
icon: "flask-vial"
5+
---
6+
7+
Before deploying, test locally. See [Sandboxing](/guides/sandboxing) for Docker vs no-Docker patterns.
8+
9+
## Local Testing
10+
11+
| Environment | `local_test.py` |
12+
|-------------|-----------------|
13+
| No Docker | `from env import env` |
14+
| Docker | `env.connect_url("http://localhost:8765/mcp")` |
15+
16+
Both use the same API after setup:
17+
18+
```python
19+
async with env:
20+
tools = env.as_tools() # List available tools
21+
result = await env.call_tool("my_tool", arg="val") # Call a tool
22+
```
23+
24+
## Testing Scenarios Directly
25+
26+
Scenarios are async generators. `hud.eval()` drives them automatically, but you can test the logic directly—this is exactly what runs at the start and end of `hud.eval()`:
27+
28+
```python
29+
async def checkout(user_id: str, amount: int = 100):
30+
# Setup + prompt (first yield) — runs at hud.eval() start
31+
answer = yield f"Complete checkout for {user_id}, ${amount}"
32+
33+
# Evaluation (second yield) — runs after agent submits
34+
yield 1.0 if "success" in answer.lower() else 0.0
35+
36+
async def test():
37+
gen = checkout("alice", 50)
38+
prompt = await anext(gen) # What hud.eval() does at start
39+
reward = await gen.asend("Success!") # What hud.eval() does after submit
40+
assert reward == 1.0
41+
```
42+
43+
If your scenario tests pass, `hud.eval()` will behave identically.
44+
45+
## Mocking
46+
47+
`env.mock()` intercepts at the tool layer—agents only see tools:
48+
49+
```python
50+
env.mock() # All tools return fake responses
51+
env.mock_tool("send_email", {"status": "sent"})
52+
53+
# Check mock state
54+
assert env.is_mock == True
55+
```
56+
57+
## Hot-Reload
58+
59+
For Docker environments, `hud dev -w path` reloads Python on save:
60+
61+
```bash
62+
hud dev -w scenarios -w tools --port 8765
63+
```
64+
65+
System services (postgres, VNC, browsers) persist across reloads.
66+
67+
## Debugging Build Failures
68+
69+
`hud build` runs the exact same pipeline as **New → Environment** on [hud.ai](https://hud.ai)—so if it passes locally, it'll work in production. If the build fails or the container crashes on startup, use `hud debug` to run a 5-phase compliance test:
70+
71+
```bash
72+
hud debug my-env:latest
73+
```
74+
75+
Output shows exactly which phase failed:
76+
```
77+
✓ Phase 1: Docker image exists
78+
✓ Phase 2: MCP server responds to initialize
79+
✗ Phase 3: Tool discovery failed
80+
→ Error: Connection refused on port 8005
81+
→ Hint: Backend service may not be starting
82+
```
83+
84+
You can also debug a directory (builds first) or stop at a specific phase:
85+
86+
```bash
87+
hud debug . # Build and debug current directory
88+
hud debug . --max-phase 3 # Stop after phase 3
89+
hud debug --config mcp.json # Debug from config file
90+
```
91+
92+
## Useful Environment Properties
93+
94+
```python
95+
# Check parallelization (for running multiple evals)
96+
env.is_parallelizable # True if all connections are remote
97+
98+
# List what's connected
99+
env.connections # Dict of connection names → connectors
100+
env.is_connected # True if in async context
101+
102+
# Resources and prompts (beyond tools)
103+
await env.list_resources() # MCP resources
104+
await env.list_prompts() # MCP prompts
105+
```

docs/beta/index.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,5 +11,5 @@ Beta features are experimental and may change in future releases.
1111
## Available Beta Features
1212

1313
<Card title="Reinforcement Fine-Tuning (RFT)" icon="brain-circuit" href="/beta/rft">
14-
Fine-tune models with reinforcement learning on your HUD tasks (invite-only)
14+
Fine-tune models on your HUD tasks (invite-only)
1515
</Card>

docs/build-environments/index.mdx

Lines changed: 17 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -66,9 +66,6 @@ hud eval tasks.json
6666

6767
# Deploy to registry
6868
hud push
69-
70-
# Train agents on your tasks
71-
hud rl tasks.json
7269
```
7370

7471
---
@@ -83,7 +80,6 @@ hud rl tasks.json
8380
| Troubleshoot | `hud debug my-env:dev` |
8481
| Build image | `hud build` |
8582
| Push to registry | `hud push` |
86-
| RL training | `hud rl tasks.json` |
8783

8884
---
8985

@@ -93,3 +89,20 @@ hud rl tasks.json
9389
* **CLI reference**: [CLI Overview](/reference/cli/overview)
9490

9591
Have fun – and remember: *stderr for logs, stdout for MCP!*
92+
93+
---
94+
95+
## Available Environments
96+
97+
Browse ready-to-use environments and templates at **[hud.ai/environments](https://hud.ai/environments)**.
98+
99+
| Environment | Description |
100+
|-------------|-------------|
101+
| `hud-blank` | Minimal starter template |
102+
| `hud-browser` | Browser automation with Playwright |
103+
| `hud-remote-browser` | Cloud browser providers (Steel, Anchor, etc.) |
104+
| `hud-deepresearch` | Deep research with web search |
105+
| `hud-rubrics` | LLM-as-judge evaluations |
106+
| `coding-template` | Full coding env with VNC, Postgres, Redis |
107+
108+
Each environment is available as a GitHub template you can fork and customize.

docs/build-environments/spec.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ graph TD
2424
- No non‑MCP output on stdout (all logging to stderr).
2525
- No required file layout, framework, or endpoints.
2626

27-
Recommended (for HUD RL/evals): provide tools named `setup` and `evaluate`.
27+
Recommended (for HUD evals): provide tools named `setup` and `evaluate`.
2828

2929
## Make it runnable remotely (mcp.hud.ai)
3030

@@ -143,7 +143,7 @@ The same structure is used by `hud init`’s template and by programmatic tasks.
143143
]
144144
```
145145

146-
Switching this file to remote is as simple as replacing the `mcp_config` with the `hud` section shown above (or using `hud rl`, which will help convert it automatically).
146+
Switching this file to remote is as simple as replacing the `mcp_config` with the `hud` section shown above (or using `hud convert`, which will help convert it automatically).
147147

148148
Run tasks with either the CLI or an agent:
149149

docs/docs.json

Lines changed: 73 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -29,12 +29,81 @@
2929
"navigation": {
3030
"versions": [
3131
{
32-
"version": "0.4.74",
32+
"version": "0.5.0",
3333
"groups": [
3434
{
3535
"group": "Get Started",
3636
"pages": [
3737
"index",
38+
"llm-quickstart"
39+
]
40+
},
41+
{
42+
"group": "Essentials",
43+
"pages": [
44+
"quick-links/gateway",
45+
"quick-links/ab-testing",
46+
"quick-links/environments",
47+
"quick-links/deploy"
48+
]
49+
},
50+
{
51+
"group": "Guides",
52+
"pages": [
53+
"guides/integrations",
54+
"guides/sandboxing",
55+
"guides/best-practices",
56+
"migration"
57+
]
58+
},
59+
{
60+
"group": "Advanced",
61+
"pages": [
62+
"advanced/testing-environments"
63+
]
64+
},
65+
{
66+
"group": "SDK Reference",
67+
"pages": [
68+
"reference/evals",
69+
"reference/environments",
70+
"reference/tools",
71+
"reference/mcpserver",
72+
"reference/agents",
73+
"reference/types"
74+
]
75+
},
76+
{
77+
"group": "CLI Reference",
78+
"pages": [
79+
"reference/cli/overview",
80+
"reference/cli/init",
81+
"reference/cli/dev",
82+
"reference/cli/build",
83+
"reference/cli/push",
84+
"reference/cli/analyze",
85+
"reference/cli/debug",
86+
"reference/cli/run",
87+
"reference/cli/eval",
88+
"reference/cli/rft",
89+
"reference/cli/misc"
90+
]
91+
},
92+
{
93+
"group": "Community",
94+
"pages": [
95+
"contributing"
96+
]
97+
}
98+
]
99+
},
100+
{
101+
"version": "0.4.73",
102+
"groups": [
103+
{
104+
"group": "Get Started",
105+
"pages": [
106+
"index-legacy",
38107
"quickstart",
39108
"llm-quickstart"
40109
]
@@ -50,10 +119,11 @@
50119
{
51120
"group": "SDK Reference",
52121
"pages": [
122+
"reference/eval",
53123
"reference/tools",
54124
"reference/agents",
55125
"reference/types",
56-
"reference/environments",
126+
"reference/mcpserver",
57127
"reference/tasks"
58128
]
59129
},
@@ -64,17 +134,10 @@
64134
"build-environments/spec"
65135
]
66136
},
67-
{
68-
"group": "Training (RL)",
69-
"pages": [
70-
"train-agents/quickstart",
71-
"train-agents/tasks"
72-
]
73-
},
74137
{
75138
"group": "HUD Gateway",
76139
"pages": [
77-
"gateway/index"
140+
"gateway/index-legacy"
78141
]
79142
},
80143
{
@@ -103,7 +166,6 @@
103166
"reference/cli/debug",
104167
"reference/cli/run",
105168
"reference/cli/eval",
106-
"reference/cli/rl",
107169
"reference/cli/rft",
108170
"reference/cli/misc"
109171
]

docs/evaluate-agents/benchmarks.mdx

Lines changed: 27 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -18,7 +18,30 @@ hud eval tasks.json
1818
hud eval hud-evals/SheetBench-50 claude --full
1919
```
2020

21-
- SDK
21+
- SDK (Context Manager)
22+
23+
```python
24+
import hud
25+
26+
# Single task evaluation
27+
async with hud.eval("hud-evals/SheetBench-50:0") as ctx:
28+
agent = MyAgent()
29+
result = await agent.run(ctx)
30+
ctx.reward = result.reward
31+
32+
# All tasks with variants
33+
async with hud.eval(
34+
"hud-evals/SheetBench-50:*",
35+
variants={"model": ["claude-sonnet", "gpt-4o"]},
36+
group=3,
37+
max_concurrent=50,
38+
) as ctx:
39+
agent = create_agent(model=ctx.variants["model"])
40+
result = await agent.run(ctx)
41+
ctx.reward = result.reward
42+
```
43+
44+
- SDK (Batch Execution)
2245

2346
```python
2447
from hud.datasets import run_tasks
@@ -108,8 +131,9 @@ results = await run_tasks(
108131

109132
## See Also
110133

111-
- [`hud eval`](/reference/cli/eval)
112-
- [`hud rl`](/reference/cli/rl)
134+
- [Evaluation API](/reference/eval) - SDK reference for `hud.eval()`
135+
- [`hud eval`](/reference/cli/eval) - CLI reference
136+
- [`hud rft`](/reference/cli/rft)
113137
- [Tasks](/reference/tasks)
114138
- [Agents (SDK)](/reference/agents)
115139

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,5 @@
11
---
2-
title: "HUD Gateway"
2+
title: "Gateway"
33
description: "Unified LLM inference service with built-in auth and credit management."
44
icon: "server"
55
---
@@ -128,3 +128,4 @@ This example demonstrates:
128128
- Automatic token usage and latency tracking
129129

130130
View your traces on the [HUD Dashboard](https://hud.ai/home).
131+

0 commit comments

Comments
 (0)