Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
324 changes: 324 additions & 0 deletions content/docs/datasets/adapters-human.mdx
Original file line number Diff line number Diff line change
@@ -0,0 +1,324 @@
---
title: Adapters (Human Guide)
description: A concise guide for human readers to create a Harbor adapter for your benchmark.
---

import { Callout } from 'fumadocs-ui/components/callout';
import { File, Folder, Files } from 'fumadocs-ui/components/files';

To add a new benchmark or dataset to Harbor, you create an [adapter](https://github.com/laude-institute/harbor/tree/main/adapters) that translates the original benchmark's tasks into Harbor format.

<Callout title="Using an AI agent to build your adapter?">
AI agents should follow the spec at [Adapter AI Guideline](/docs/datasets/adapters)
instead of this page. That document contains the complete schema,
all edge cases, and machine-verifiable examples.
Do not use the tutorial below as your source of truth.
</Callout>

<Callout title="Need help or want to contribute?">
Join our [Discord](https://discord.com/invite/6xWPKhGDbA) (`#adapters-announcements`) and reach out to [Lin Shi](mailto:ls2282@cornell.edu). Check the [Adapter List](https://docs.google.com/spreadsheets/d/1mJbiASPm32DDNzEnV6eDGwpEf3FlMUe5dhkZmFjjSoo/edit?gid=0#gid=0) for available benchmarks. We cover API costs for parity experiments.
</Callout>

## Quick Start

```bash
# List available datasets
harbor dataset list

# Scaffold a new adapter interactively
harbor adapter init

# Or with arguments
harbor adapter init my-adapter --name "My Benchmark"
```

## Steps at a Glance

| # | Step | Goal |
|---|------|------|
| 1 | [Understand the benchmark](#1-understand-the-original-benchmark) | Identify instructions, environments, tests, and solutions |
| 2 | [Write the adapter code](#2-write-the-adapter-code) | Generate Harbor-format task directories |
| 3 | [Verify oracle solutions](#3-verify-oracle-solutions) | All oracle solutions pass at 100% reward |
| 4 | [Plan parity & implement agents](#4-plan-parity--implement-agents) | Coordinate with the team; set up agents on both sides |
| 5 | [Run parity experiments](#5-run-parity-experiments) | Compare Harbor vs. original benchmark scores |
| 6 | [Record parity results](#6-record-parity-results) | Save results to `parity_experiment.json` |
| 7 | [Upload results](#7-upload-results) | Push to HuggingFace parity dataset |
| 8 | [Register the dataset](#8-register-the-dataset) | Prepare dataset with `harbor init` and `dataset.toml`, submit for publishing |
| 9 | [Document & submit](#9-document--submit) | Write README, submit PR for review |

---

## 1. Understand the Original Benchmark

Before coding, study the original benchmark and identify four key components:

1. **Task Instructions** — How are tasks described? What do agents need?
2. **Environments** — What setup is required? (Docker, dependencies, file structures)
3. **Tests** — How are solutions evaluated? (unit tests, LLM-as-a-Judge, etc.)
4. **Solutions** — What are the oracle/reference solutions?

## 2. Write the Adapter Code

### 2.0 Read the README template first

The [adapter README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) doubles as a requirements checklist. Read it before writing code — it tells you what you'll need to provide.

### 2.1 Fork and branch

```bash
git clone https://github.com/{you}/harbor.git
cd harbor
git checkout -b {adapter-name}-adapter
```

### 2.2 Target task directory structure

Each generated task should look like this:

<Files>
<Folder name="<adapter-name>" defaultOpen>
<Folder name="<task-id>" defaultOpen>
<File name="task.toml" />
<File name="instruction.md" />
<Folder name="environment" defaultOpen>
<File name="Dockerfile" />
</Folder>
<Folder name="solution" defaultOpen>
<File name="solve.sh" />
</Folder>
<Folder name="tests" defaultOpen>
<File name="test.sh" />
<File name="test_*.py (optional)" />
</Folder>
</Folder>
</Folder>
</Files>

See the [hello-world example](https://github.com/laude-institute/harbor/tree/main/examples/tasks/hello-world) for a concrete reference.

### 2.3 Adapter code structure

Your adapter lives in `harbor/adapters/{adapter-name}/`:

| File | Purpose |
|------|---------|
| `adapter.py` | Core logic: parse benchmark data, generate task dirs |
| `run_adapter.py` | CLI entry point (supports `--output-path`) |
| `template/` | Template files copied into each task |
| `parity_experiment.json` | Parity results (filled in later) |
| `run_{name}.yaml` | Reference config for reproducibility |
| `README.md` | Final documentation (written last) |
| `adapter_metadata.json` | Structured metadata about the adapter |

**Requirements for `run_adapter.py`:**
- Support cloning the source benchmark temporarily (with cleanup)
- Support using an already-cloned repo
- Default output to `datasets/{adapter-name}`, with `--output-path` override

**Tips:**
- Minor prompt tweaks (e.g., "write files in place without asking") are fine, as long as they apply to both the original benchmark and Harbor sides.
- Adapting only a subset of tasks is acceptable if documented in the README.

## 3. Verify Oracle Solutions

Run your adapter with the oracle agent and confirm **100% reward on all tasks**.

```bash
# Single task
harbor trial start -p datasets/<adapter-name>/<task-id>

# Entire dataset
harbor run -p datasets/<adapter-name>

# With a config file (recommended for reproducibility)
harbor run -c adapters/<adapter-name>/<config>.yaml -a <agent> -m <model>
```

Once oracle passes, create a **WIP PR** titled `[WIP] Adapter: {adapter_name}` with a screenshot of the 100% pass results.

<Callout title="Broken oracles in the original benchmark?">
Don't fix them on the Harbor side. Document the affected tasks, file bugs to the upstream benchmark, and exclude those tasks if they can't be reliably verified. This keeps Harbor adapters faithful to the original.
</Callout>

## 4. Plan Parity & Implement Agents

Reach out to the team (e.g., **Lin Shi**) on [Discord](https://discord.com/invite/6xWPKhGDbA) **before** running parity experiments. They will help decide:
- Which agents and models to use
- How many runs are needed
- API key provisioning

Depending on your benchmark, you'll fall into one of three scenarios:

**Scenario A — Compatible agents exist:** The original benchmark already supports Harbor-compatible agents (OpenHands, Codex, Claude-Code, etc.). No extra work needed.

**Scenario B — LLM-based, no compatible agents:** Fork the original benchmark, implement a Harbor-compatible agent there, and document it. See the [EvoEval example](https://github.com/laude-institute/harbor/blob/main/adapters/evoeval/parity_experiment.json).

**Scenario C — Custom agents:** Implement the custom agent in Harbor under `adapters/{name}/{agent}.py` and run parity experiments with the custom agents. Also run experiments with standard agents (Codex, Claude-Code) to show the adapter generalizes.

<Callout title="Large benchmarks">
For expensive benchmarks, you can run parity on a representative subset. Discuss sampling strategy with the team first. Use `--split parity` in your adapter and register with `"version": "parity"` so users can run `-d {name}@parity`.
</Callout>

## 5. Run Parity Experiments

Run the **same agents, models, and settings** on both the original benchmark and your Harbor adapter, multiple times each. Compare average scores and standard deviations — they should be **comparable** to demonstrate equivalence.

```bash
# Harbor side
harbor run -p datasets/<adapter-name> -a <agent> -m <model>
```

## 6. Record Parity Results

Create `parity_experiment.json` in your adapter directory:

```json
[
{
"adapter_name": "<adapter-name>",
"agent": "<agent>@<version>",
"model": "<model-version>",
"date": "<date>",
"adapted_benchmark_size": "<total-tasks-converted>",
"parity_benchmark_size": "<tasks-used-for-parity>",
"number_of_runs": "<runs-per-side>",
"notes": "<any special notes>",
"original_parity_repo": "<fork-url>",
"adapter_pr": ["<pr-url>"],
"dataset_pr": ["<pr-url>"],
"parity_pr": ["<hf-pr-url>"],
"metrics": [
{
"benchmark_name": "<name>",
"metric": "<metric>",
"original": "<mean ± stderr>",
"harbor": "<mean ± stderr>",
"original_runs": ["<run1>", "<run2>", "..."],
"harbor_runs": ["<run1>", "<run2>", "..."]
}
]
}
]
```

Also include a summary table in your README:

```markdown
| Agent | Model | Metric | Runs | Dataset Size | Original | Harbor |
|-------|-------|--------|------|--------------|----------|--------|
| codex | gpt-5 | pass@1 | 5 | 2000 (100%) | X ± Y | X ± Y |
```

## 7. Upload Results

Upload parity and oracle results to the [HuggingFace parity-experiments dataset](https://huggingface.co/datasets/harborframework/parity-experiments). The [parity upload skill](https://github.com/harbor-framework/harbor/tree/main/skills/upload-parity-experiments) can automate this workflow.

```
adapters/<adapter_name>/
├── README.md
├── config.yaml
├── original_parity/
├── harbor_parity/
├── oracle/
└── results_collection/
├── result_{original/harbor}_trial1.json
└── ...
```

## 8. Register the Dataset

### 8.1 Generate dataset

```bash
git clone https://github.com/{you}/harbor-datasets.git
cd harbor/adapters/<adapter-name>
uv run run_adapter.py --output-dir /path/to/harbor-datasets/datasets/<adapter-name>
```

Generate `dataset.toml`:

```bash
cd harbor-datasets/datasets/<adapter-name>
harbor init
# Select "dataset" when prompted
```

Edit the generated `dataset.toml` to fill in metadata: parity results summary, adapter author credits, and any acknowledgments.

**Version naming:** Use `"1.0"` by default. Follow the original benchmark's naming if it has versions (e.g., "verified", "lite"). Use `"parity"` for parity subsets so users can run `-d <adapter_name>@parity`.

Create a PR to `harbor-datasets`. Request `@Slimshilin` for review.

### 8.2 Test locally

Before submitting for publishing, verify with the `-p` path parameter:

```bash
harbor run -p /path/to/your/dataset
```

<Callout title="Registry testing is only available post-publish">
You cannot test against the registry (using `-d`) until the dataset has been published. Use `-p` (local path) for all pre-publish testing.
</Callout>

### 8.3 Submit for publishing

Include your tasks directory and `dataset.toml` in your adapter PR. Once approved, the Harbor team will publish the dataset to the registry.

### 8.4 Verify post-publish

Once published, verify it loads and runs correctly:

```bash
harbor run -d <organization-name>/<adapter-name>
```

## 9. Document & Submit

Fill out the [README template](https://github.com/laude-institute/harbor/blob/main/docs/adapters/templates/README.md) covering:
- Benchmark bugs discovered and how they were handled
- Special treatments (prompt tweaks, environment adjustments)
- Deviations from the original and why
- Agent implementation details
- Known limitations

Create `adapter_metadata.json` ([see format in full docs](/docs/datasets/adapters#9-document-and-submit)).

When ready, update your PR title from `[WIP]` to `[Ready for Review]` and request review from `@Slimshilin`.

---

## Appendix: Terminal-Bench Migration

If you're converting a Terminal-Bench adapter, here are the key differences:

| Aspect | Terminal-Bench | Harbor |
|--------|---------------|--------|
| Config | `task.yaml` | `task.toml` |
| Instruction | In `task.yaml` | Separate `instruction.md` |
| Dockerfile | Root level | `environment/Dockerfile` |
| Solution | `solution.sh` | `solution/solve.sh` |
| Tests | `run-tests.sh` + `tests/` | `tests/test.sh` |
| Verification | Exit code (pytest) | Reward file: `/logs/verifier/reward.txt` |
| Output dir | `tasks/` | `datasets/` |
| Registry | Dataset-level `dataset_path` | `dataset.toml` + `harbor init` publishing workflow |
| CLI | `tb run --dataset` | `harbor run -d` / `harbor run -t` /`harbor run -p` |
| Metrics | Binary pass/fail | Float rewards, multiple metrics |

**Important:** If Terminal-Bench used a tweaked metric, re-implement to support the **original** benchmark metrics — Harbor supports multiple metrics as rewards.

Migration checklist:
1. Convert `task.yaml` → `task.toml` + `instruction.md`
2. Reorganize files into `environment/`, `solution/`, `tests/` subdirs
3. Update test scripts to write rewards to `/logs/verifier/reward.txt`
4. Change output directory from `tasks/` to `datasets/`
5. Update registry format using `harbor init` and `dataset.toml`

---

## Resources

- [Harbor docs](/docs/getting-started) — Running tasks and jobs
- [Harbor repo](https://github.com/laude-institute/harbor) — Examples and configs
- [Agent tutorial](/docs/agents) — Creating custom agents
- [Discord](https://discord.com/invite/6xWPKhGDbA) — Ask questions in `#adapters-spam`
Loading