Spec-Driven Development Guide

The ceiling on AI-generated code is usually the spec, not the model. If the brief is weak, better review stages only help you fail more carefully.

Problem

Teams often optimize the visible stages of AI development: more review passes, more retries, more agents. But the code quality is still bounded by what the implementation agent was asked to build in the first place.

Research from ClarifyGPT showed spec refinement improved GPT-4 Pass@1 on MBPP from 70.96% to 80.80%, a 13.87% lift with p=3.2e-05. Research on progressive prompting found 96.9% task completion versus 80.5% for direct prompting, with a Cohen's d of 1.63. The pattern is consistent: stronger task framing beats extra downstream cleanup.

The Core Idea

Treat the spec as the primary control surface.

That means:

define what good looks like before implementation
constrain what downstream agents can reinterpret
attach validation targets that survive review

This repo calls that practice Spec-Driven Development. The exact phrase is newer than the underlying ideas, but the mechanics already show up in GitHub spec-kit, Anthropic Claude Code plan mode, and research systems that use structured handoffs.

How It Works

1. Use a Structured Handoff Schema

Research on MetaGPT suggests its main advantage came from SOP-style structured outputs, not role-playing personas. The handoff format matters more than whether the agent is called "architect" or "senior engineer."

Use this schema:

Field	Why It Exists
Objective	Prevents the task from drifting into adjacent work
Context	Gives the constraints the agent cannot infer
Acceptance Criteria	Defines "done" in a testable way
Interface Contract	Locks the boundaries that should not change accidentally
Anti-patterns	Names the failure modes to actively avoid
Review Checklist	Gives the reviewer a fixed target

For algorithm work, add Data Integrity Requirements:

no look-ahead leakage
point-in-time data only
purge + embargo splits where overlap can leak signal
realistic transaction costs and slippage

These constraints are not detail. They are the difference between a plausible backtest and a fake edge.

2. Make acceptance criteria executable

This is the non-negotiable part.

The arXiv 2603.25773 review paper argues that AI code review without executable specifications is structurally circular. In experiments with domain-opaque bugs, review without specs ranged from 0% to 100% detection. With BDD-style scenarios used as the spec, detection reached 100%.

If the agent cannot prove it met the spec, you do not have a usable spec yet.

3. Scale the process to task complexity

Martin Fowler's 2026 guidance is useful here because it avoids process inflation:

Task Complexity	Spec Shape
Simple	Verbal spec + 1-2 acceptance criteria
Medium	Roughly one page + interface definition + test plan
Complex	Full Structured Handoff Schema

The trap is over-specifying easy work. For medium tasks, a full SDD workflow can add more overhead than plain AI-assisted coding. Use the smallest spec that still prevents ambiguity.

4. Challenge the spec before you challenge the code

For complex work, run a spec challenge before implementation. That is not a separate academic technique with a single canonical name, but functionally similar practices keep appearing because they work: one agent proposes, another tries to break assumptions, a judge narrows the disagreement before execution begins.

The cheapest bug is the one removed before code exists.

5. Switch to an Evaluator Pattern when the right answer is unknown

Spec-Driven Development assumes you mostly know what "correct" looks like. If you are exploring algorithm design or search space discovery, use an Evaluator Pattern instead:

LLM proposes candidate → deterministic evaluator scores it → keep best → iterate

That is the core lesson from FunSearch in Nature 2024. The evaluator guards against confabulations because the model does not get to declare its own answer correct.

Anti-Patterns

Natural-language-only specs

The agent gets a broad paragraph and fills in the missing requirements by guessing.

Fix: Add acceptance criteria and an interface contract.

Spec without acceptance criteria

The work sounds clear until review starts, then nobody agrees on what "done" means.

Fix: Write criteria that a test or scenario can actually verify.

Over-specifying simple tasks

You spend more time filling templates than solving the problem.

Fix: Match the process to the complexity tier.

Real-World Example

A team needed an agent to add a reconciliation endpoint to an internal finance tool.

The weak version of the brief was:

"Add an endpoint to reconcile delayed settlements."

The Structured Handoff version was:

Objective: Add a reconciliation endpoint for delayed settlements without mutating already-booked ledger rows
Context: Existing system retries settlement callbacks and runs daily repair jobs
Acceptance Criteria: Duplicate callbacks are idempotent, unmatched records are quarantined, and repair jobs emit audit events
Interface Contract: Existing settlement schema cannot change
Anti-patterns: No silent auto-merge, no test weakening, no hidden retry loop
Review Checklist: Verify duplicate handling, audit trail, and rollback behavior

The implementation generated from the second spec took less review because the ambiguity had already been removed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spec-Driven Development Guide

Problem

The Core Idea

How It Works

1. Use a Structured Handoff Schema

2. Make acceptance criteria executable

3. Scale the process to task complexity

4. Challenge the spec before you challenge the code

5. Switch to an Evaluator Pattern when the right answer is unknown

Anti-Patterns

Natural-language-only specs

Spec without acceptance criteria

Over-specifying simple tasks

Real-World Example

FilesExpand file tree

spec-driven-development.md

Latest commit

History

spec-driven-development.md

File metadata and controls

Spec-Driven Development Guide

Problem

The Core Idea

How It Works

1. Use a Structured Handoff Schema

2. Make acceptance criteria executable

3. Scale the process to task complexity

4. Challenge the spec before you challenge the code

5. Switch to an Evaluator Pattern when the right answer is unknown

Anti-Patterns

Natural-language-only specs

Spec without acceptance criteria

Over-specifying simple tasks

Real-World Example