Skip to content

Commit d3a7e1e

Browse files
committed
docs: update guidelines
1 parent 3d471c1 commit d3a7e1e

2 files changed

Lines changed: 176 additions & 122 deletions

File tree

README.md

Lines changed: 21 additions & 117 deletions
Original file line numberDiff line numberDiff line change
@@ -188,124 +188,28 @@ pnpm compare-batches
188188
pnpm batch-details <batch-id>
189189
```
190190

191-
## Creating New Suites and Scenarios
192-
193-
### Step 1: Create Suite Structure
194-
```bash
195-
# Create new suite directory
196-
mkdir -p suites/my-new-suite/prompts/my-scenario
197-
mkdir -p suites/my-new-suite/scenarios/my-scenario/repo-fixture
198-
```
199-
200-
### Step 2: Create Scenario Configuration
201-
```bash
202-
# Copy and customize the template
203-
cp docs/templates/scenario.yaml suites/my-new-suite/scenarios/my-scenario/scenario.yaml
204-
```
205-
206-
Edit `scenario.yaml` with your specific requirements:
207-
```yaml
208-
id: "my-scenario"
209-
suite: "my-new-suite"
210-
title: "My Custom Scenario"
211-
description: "Description of what this scenario tests"
212-
213-
# Define what needs to be updated
214-
targets:
215-
required:
216-
- name: "react"
217-
to: "^18.0.0"
218-
- name: "@types/react"
219-
to: "^18.0.0"
220-
optional:
221-
- name: "typescript"
222-
to: "^5.0.0"
223-
224-
# Define validation commands
225-
validation:
226-
commands:
227-
install: "npm install"
228-
build: "npm run build"
229-
test: "npm test"
230-
```
231-
232-
### Step 3: Create Repository Fixture
233-
Create a complete codebase with intentional issues:
234-
235-
```bash
236-
# Create package.json with outdated dependencies
237-
cat > suites/my-new-suite/scenarios/my-scenario/repo-fixture/package.json << 'EOF'
238-
{
239-
"name": "test-project",
240-
"version": "1.0.0",
241-
"dependencies": {
242-
"react": "^17.0.0",
243-
"@types/react": "^17.0.0"
244-
},
245-
"devDependencies": {
246-
"typescript": "^4.0.0"
247-
},
248-
"scripts": {
249-
"build": "tsc",
250-
"test": "echo 'Tests pass'"
251-
}
252-
}
253-
EOF
254-
255-
# Add source files, config files, etc.
256-
```
257-
258-
### Step 4: Create Prompts
259-
Create different difficulty tiers:
260-
261-
```bash
262-
# L0 - Minimal context
263-
echo "Update the dependencies in this project." > suites/my-new-suite/prompts/my-scenario/L0-minimal.md
264-
265-
# L1 - Basic context
266-
echo "This React project needs its dependencies updated. Please update React and related packages to their latest compatible versions while ensuring the project still builds and tests pass." > suites/my-new-suite/prompts/my-scenario/L1-basic.md
267-
268-
# L2 - Directed guidance
269-
echo "Update the dependencies in this React project:
270-
1. Update React to the latest 18.x version
271-
2. Update @types/react to match React version
272-
3. Update TypeScript to latest 5.x version
273-
4. Ensure all tests pass
274-
5. Maintain TypeScript compatibility" > suites/my-new-suite/prompts/my-scenario/L2-directed.md
275-
```
276-
277-
### Step 5: Create Oracle Answers
278-
```bash
279-
cat > suites/my-new-suite/scenarios/my-scenario/oracle-answers.json << 'EOF'
280-
{
281-
"react": "^18.0.0",
282-
"@types/react": "^18.0.0",
283-
"typescript": "^5.0.0"
284-
}
285-
EOF
286-
```
287-
288-
### Step 6: Test Your Scenario
289-
```bash
290-
# Test with specific agent and tier
291-
pnpm bench my-new-suite my-scenario L1 anthropic
292-
293-
# Test all tiers
294-
pnpm bench my-new-suite my-scenario --batch anthropic
295-
```
296-
297-
## Documentation
298-
299-
### Comprehensive Guides
300-
- **[Adding Benchmarks](docs/ADDING-BENCHMARKS.md)** - Complete benchmark creation guide
301-
- **[Adding Evaluators](docs/ADDING-EVALUATORS.md)** - Evaluator development guide
302-
- **[Quick Start](docs/QUICK-START.md)** - Fast-track onboarding
303-
- **[Contributing](docs/CONTRIBUTING.md)** - Contribution guidelines
191+
## Contributing
304192

305-
### Templates
306-
- **[Scenario Template](docs/templates/scenario.yaml)** - Annotated configuration
307-
- **[Evaluator Template](docs/templates/heuristic-evaluator.ts)** - Complete evaluator template
308-
- **[Quality Checklists](docs/BENCHMARK-CHECKLIST.md)** - Pre-submission validation
193+
We welcome contributions! Whether you want to add new benchmarks, create evaluators, or improve documentation, we have comprehensive guides to help you get started.
194+
195+
### Quick Start
196+
- **[Contributing Guide](docs/CONTRIBUTING.md)** - Complete contribution guidelines
197+
- **[Adding Benchmarks](docs/ADDING-BENCHMARKS.md)** - Step-by-step benchmark creation
198+
- **[Adding Evaluators](docs/ADDING-EVALUATORS.md)** - Evaluator development guide
199+
200+
### Propose New Benchmarks
201+
Use our GitHub issue template to propose new benchmarks:
202+
1. Go to [GitHub Issues](https://github.com/your-org/ze-benchmarks/issues)
203+
2. Click "New Issue" → "New Benchmark Proposal"
204+
3. Fill out the template with your benchmark idea
205+
4. We'll review and help you implement it!
206+
207+
### Ready to Contribute?
208+
Check out our [Contributing Guide](docs/CONTRIBUTING.md) for detailed instructions on:
209+
- Setting up your development environment
210+
- Creating new benchmarks and evaluators
211+
- Submitting pull requests
212+
- Code quality standards
309213

310214
## Environment Variables
311215

docs/CONTRIBUTING.md

Lines changed: 155 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,11 +14,161 @@ We are committed to providing a welcoming and inclusive environment for all cont
1414

1515
## How to Contribute
1616

17-
### 1. Adding Benchmarks
18-
- Create realistic, challenging scenarios
19-
- Follow the directory structure guidelines
20-
- Include comprehensive documentation
21-
- Test with multiple agents and tiers
17+
## Creating New Benchmarks
18+
19+
### Overview
20+
Benchmarks are the core of ze-benchmarks. They test how well AI agents can perform real-world coding tasks. Each benchmark consists of:
21+
22+
- **Suite**: A collection of related benchmarks
23+
- **Scenario**: Individual test cases within a suite
24+
- **Prompts**: Difficulty tiers (L0-L3, Lx) for each scenario
25+
- **Repository Fixture**: Real codebase with intentional issues
26+
- **Oracle Answers**: Expected outcomes for validation
27+
28+
### File Structure
29+
Every benchmark must follow this exact structure:
30+
31+
```
32+
suites/YOUR-SUITE/
33+
├── prompts/YOUR-SCENARIO/
34+
│ ├── L0-minimal.md
35+
│ ├── L1-basic.md
36+
│ ├── L2-directed.md
37+
│ ├── L3-migration.md (optional)
38+
│ └── Lx-adversarial.md (optional)
39+
└── scenarios/YOUR-SCENARIO/
40+
├── scenario.yaml
41+
├── oracle-answers.json
42+
└── repo-fixture/
43+
├── package.json
44+
├── [source files]
45+
└── [config files]
46+
```
47+
48+
### Step-by-Step Creation
49+
50+
#### Step 1: Create Suite Structure
51+
```bash
52+
# Create new suite directory
53+
mkdir -p suites/my-new-suite/prompts/my-scenario
54+
mkdir -p suites/my-new-suite/scenarios/my-scenario/repo-fixture
55+
```
56+
57+
#### Step 2: Create Scenario Configuration (`scenario.yaml`)
58+
```yaml
59+
id: "my-scenario"
60+
suite: "my-new-suite"
61+
title: "My Custom Scenario"
62+
description: "Description of what this scenario tests"
63+
64+
# Define what needs to be updated
65+
targets:
66+
required:
67+
- name: "react"
68+
to: "^18.0.0"
69+
- name: "@types/react"
70+
to: "^18.0.0"
71+
optional:
72+
- name: "typescript"
73+
to: "^5.0.0"
74+
75+
# Define validation commands
76+
validation:
77+
commands:
78+
install: "npm install"
79+
build: "npm run build"
80+
test: "npm test"
81+
lint: "npm run lint"
82+
typecheck: "tsc --noEmit"
83+
```
84+
85+
#### Step 3: Create Repository Fixture
86+
Create a complete codebase with intentional issues:
87+
88+
```bash
89+
# Create package.json with outdated dependencies
90+
cat > suites/my-new-suite/scenarios/my-scenario/repo-fixture/package.json << 'EOF'
91+
{
92+
"name": "test-project",
93+
"version": "1.0.0",
94+
"dependencies": {
95+
"react": "^17.0.0",
96+
"@types/react": "^17.0.0"
97+
},
98+
"devDependencies": {
99+
"typescript": "^4.0.0"
100+
},
101+
"scripts": {
102+
"build": "tsc",
103+
"test": "echo 'Tests pass'"
104+
}
105+
}
106+
EOF
107+
108+
# Add source files, config files, etc.
109+
```
110+
111+
#### Step 4: Create Prompts
112+
Create different difficulty tiers:
113+
114+
**L0 - Minimal context:**
115+
```bash
116+
echo "Update the dependencies in this project." > suites/my-new-suite/prompts/my-scenario/L0-minimal.md
117+
```
118+
119+
**L1 - Basic context:**
120+
```bash
121+
echo "This React project needs its dependencies updated. Please update React and related packages to their latest compatible versions while ensuring the project still builds and tests pass." > suites/my-new-suite/prompts/my-scenario/L1-basic.md
122+
```
123+
124+
**L2 - Directed guidance:**
125+
```bash
126+
echo "Update the dependencies in this React project:
127+
1. Update React to the latest 18.x version
128+
2. Update @types/react to match React version
129+
3. Update TypeScript to latest 5.x version
130+
4. Ensure all tests pass
131+
5. Maintain TypeScript compatibility" > suites/my-new-suite/prompts/my-scenario/L2-directed.md
132+
```
133+
134+
#### Step 5: Create Oracle Answers (`oracle-answers.json`)
135+
```bash
136+
cat > suites/my-new-suite/scenarios/my-scenario/oracle-answers.json << 'EOF'
137+
{
138+
"react": "^18.0.0",
139+
"@types/react": "^18.0.0",
140+
"typescript": "^5.0.0"
141+
}
142+
EOF
143+
```
144+
145+
#### Step 6: Test Your Scenario
146+
```bash
147+
# Test with specific agent and tier
148+
pnpm bench my-new-suite my-scenario L1 anthropic
149+
150+
# Test all tiers
151+
pnpm bench my-new-suite my-scenario --batch anthropic
152+
```
153+
154+
### Quality Checklist
155+
Before submitting your benchmark:
156+
157+
- [ ] Repository fixture is realistic and complete
158+
- [ ] Dependencies have intentional version mismatches
159+
- [ ] Prompts are clear and appropriately detailed for each tier
160+
- [ ] Validation commands match the project setup
161+
- [ ] Oracle answers are correct
162+
- [ ] Benchmark runs successfully with different agents
163+
- [ ] All tiers provide appropriate challenge levels
164+
- [ ] Documentation is clear and complete
165+
166+
### Proposing New Benchmarks
167+
Use our GitHub issue template to propose new benchmarks:
168+
1. Go to [GitHub Issues](https://github.com/your-org/ze-benchmarks/issues)
169+
2. Click "New Issue" → "New Benchmark Proposal"
170+
3. Fill out the template with your benchmark idea
171+
4. We'll review and help you implement it!
22172

23173
### 2. Adding Evaluators
24174
- Implement the Evaluator interface correctly

0 commit comments

Comments
 (0)