Evaluates the quality and correctness of Next.js code against popular AI models.
- Bun - JavaScript runtime & package manager
- pnpm - Package manager (for shared dependency management)
# Clone the repository
git clone <repository-url>
cd next-evals
# Install dependencies
pnpm install
# Show help
bun cli.ts --helpSet up your API keys:
# For LLM-based evals
export BRAINTRUST_API_KEY="your-braintrust-key"
export AI_GATEWAY_API_KEY="your-ai-gateway-key"
# For Claude Code evals
export ANTHROPIC_API_KEY="your-anthropic-key"Note: The --dry flag is recommended for testing as it runs evaluations locally without uploading results to Braintrust
Run evals using various LLM models (configured in lib/models.ts):
# Show help and all available options
bun cli.ts --help
# Run a specific eval (uploads results to Braintrust)
bun cli.ts --eval 001-server-component
# Run eval locally without Braintrust upload (recommended for testing)
bun cli.ts --dry --eval 001-server-component
# Run all evals in parallel
bun cli.ts --all --dry
# Run with multiple worker threads for better performance
# (useful for large eval sets, automatically manages concurrency)
bun cli.ts --all --dry --threads 4
# Run with all models (default: only first model)
bun cli.ts --dry --eval 001-server-component --all-models
# Debug mode - keep output folders for inspection
bun cli.ts --dry --debug --eval 001-server-component
# Verbose output - see detailed logs during execution
bun cli.ts --dry --verbose --eval 001-server-component
# Create a new eval from template
bun cli.ts --create --name "my-new-eval" --prompt "Create something cool"Run evals using Claude Code (AI coding agent):
# Run a specific eval with Claude Code
bun claude-code-cli.ts --eval 001-server-component
# Or use the main CLI with --claude-code flag
bun cli.ts --eval 001-server-component --claude-code
# Run all evals with Claude Code
bun claude-code-cli.ts --all
# With custom timeout (default: 600000ms = 10 minutes)
bun claude-code-cli.ts --eval 001-server-component --timeout 900000
# With custom API key (or use ANTHROPIC_API_KEY env var)
bun claude-code-cli.ts --eval 001-server-component --api-key sk-ant-...
# Verbose output
bun claude-code-cli.ts --eval 001-server-component --verbose
# Debug mode - keep output folders
bun claude-code-cli.ts --eval 001-server-component --debugRun Claude Code with a Next.js dev server and lifecycle hooks (e.g., for MCP server setup):
# Run with dev server and hook scripts
bun cli.ts --eval 001-server-component --claude-code \
--with-dev-server \
--pre-eval ./scripts/eval-hooks/nextjs-mcp-pre.sh \
--post-eval ./scripts/eval-hooks/nextjs-mcp-post.sh
# Customize dev server command and port
bun cli.ts --eval 001-server-component --claude-code \
--with-dev-server \
--dev-server-cmd "pnpm dev" \
--dev-server-port 3001 \
--pre-eval ./scripts/eval-hooks/nextjs-mcp-pre.sh \
--post-eval ./scripts/eval-hooks/nextjs-mcp-post.shDev Server & Hook Options:
--with-dev-server- Start Next.js dev server before eval--dev-server-cmd <cmd>- Command to start server (default: "npm run dev")--dev-server-port <port>- Port for dev server (default: 3000)--pre-eval <script>- Script to run after dev server starts, before Claude runs--post-eval <script>- Script to run after eval completes (for cleanup)
Hook scripts receive these environment variables:
$PORT- The port the dev server is running on$OUTPUT_DIR- Path to the output directory where Claude is working$EVAL_NAME- Name of the current eval (e.g., "001-server-component")$EVAL_DIR- Path to the eval directory
Example hook scripts are provided in scripts/eval-hooks/:
nextjs-mcp-pre.sh- Configures Next.js MCP servernextjs-mcp-post.sh- Cleans up Next.js MCP server configuration
Each eval consists of:
- Input Directory (
input/): A complete Next.js app in its initial state with failing tests - Prompt File (
prompt.md): Contains the prompt text for the LLM - Output Directory (
output/): Generated during eval run, contains LLM-modified project
- Copy: Input directory is copied to output directory
- Dev Server (optional): Start Next.js dev server if
--with-dev-serveris enabled - Pre-Eval Hook (optional): Run setup script if
--pre-evalis provided - Analyze: LLM reads all project files and receives the eval prompt
- Generate: LLM provides changes as a unified diff
- Apply: Diff is applied to the output directory using git
- Validate: Project is built, linted, and tested
- Post-Eval Hook (optional): Run cleanup script if
--post-evalis provided - Score: Success is measured by build/lint/test results (binary 1.0/0.0)
- Cleanup: Output directory and dev server are stopped
Use --dry flag to run evaluations locally without uploading results to Braintrust:
- Results are displayed in the CLI with detailed pass/fail information
- Shows timing for build, lint, and test phases
- Displays debug output for failed evaluations
- Useful for quick testing and development
When running --all, evals execute in parallel for faster results. You can control the level of parallelism:
- Single-threaded (default):
bun cli.ts --all- All evals run in the main process - Multi-threaded:
bun cli.ts --all --threads 4- Evals run in isolated worker threads
Multi-threading benefits:
- True parallelism across CPU cores
- Memory isolation between evals
- Better resource utilization
- Fault isolation (one failing eval doesn't affect others)
- Automatically limited to available CPU cores
Results are displayed in a summary table:
| Eval | Status | Build | Lint | Tests |
|--------------------------|--------|-------|-------|-------|
| 001-server-component | ✅ PASS | ✅ | ✅ | ✅ |
| 002-client-component | ❌ FAIL | ❌ | ✅ | ✅ |
Use --debug flag to preserve output folders for inspection:
bun cli.ts --dry --debug --eval 001-server-componentThis keeps the output-dry/ folders after completion, allowing you to:
- Inspect the AI-generated changes
- Debug build/lint/test failures
- Understand why an eval failed
The CLI will:
- Create a new numbered directory under
evals/ - Copy the
template/directory to create theinput/state - Create a
prompt.mdfile with your prompt text
Currently configured models (in lib/models.ts).
To add or modify models, edit the MODELS array in lib/models.ts.
To add your own custom model providers, see Custom Model Provider Integration for detailed instructions on how to configure and integrate your models.
This project uses a shared dependency system where all evals use the same node_modules for consistency and easier version management.
- Shared dependencies: All 50+ evals share
evals/node_modules/ - Template-based:
template/package.jsonis the source of truth for all package versions - Centralized updates: One script syncs versions across all evals
# Update dependencies across all evals
# 1. Edit template/package.json with new versions
# 2. Run:
bun scripts/sync-templates.tsThe sync script will:
- Copy
template/package.jsonto all evalinput/directories - Copy
template/next.config.tsto all evalinput/directories - Compare
evals/package.jsonwith the template - If different: remove
evals/node_modules, update, and runpnpm install
All dependencies are defined in template/package.json:
{
"dependencies": {
"next": "15.5.4",
"react": "19.1.0",
"react-dom": "19.1.0"
}
}To update a dependency across all evals:
-
Edit the version in
template/package.json:{ "dependencies": { "next": "15.4.0" } } -
Run the sync script:
bun scripts/sync-templates.ts
-
Done! All 50+ evals now use Next.js 15.4.0
The script automatically:
- Copies the updated
package.jsonto all evals - Detects the version change
- Removes old
node_modules - Installs fresh dependencies
For shared dependencies (used by multiple evals):
-
Add to
template/package.json:{ "dependencies": { "my-new-package": "^1.0.0" } } -
Run the sync script:
bun scripts/sync-templates.ts
Note: template/package.json contains the superset of all dependencies. Even if only one eval needs a package (like @ai-sdk/react), it's included in the template for simplicity.
The sync script also updates next.config.ts across all evals:
-
Edit the template:
# Edit this file with your desired config vim template/next.config.ts -
Apply to all evals:
bun scripts/sync-templates.ts
# Update dependencies or configs across all evals
# 1. Edit template/package.json or template/next.config.ts
# 2. Run:
bun scripts/sync-templates.ts
# The script will:
# - Copy templates to all eval input/ directories
# - Detect changes and reinstall dependencies if needed
# - Show progress and summaryThe template/ directory contains a basic Next.js project that serves as the starting point for all new evals. It includes:
- Next.js 15 with App Router
- TypeScript configuration
- ESLint and testing setup with Vitest
- A failing test that evals should fix
- All dependencies needed across all evals (superset)
The eval system supports lifecycle hooks that run before and after evaluations. This is useful for:
- Setting up MCP servers for Claude
- Starting additional services
- Configuring development environments
- Cleanup tasks
Hook scripts receive environment variables with context about the eval:
#!/bin/bash
# Example pre-eval hook
echo "Setting up for $EVAL_NAME"
echo "Dev server on port: $PORT"
echo "Working directory: $OUTPUT_DIR"
# Your setup logic hereAvailable environment variables:
$PORT- Dev server port (if--with-dev-serveris used)$OUTPUT_DIR- Absolute path to output directory$EVAL_NAME- Eval identifier (e.g., "001-server-component")$EVAL_DIR- Absolute path to eval directory
The included example scripts show how to configure Claude's MCP server:
Pre-eval (scripts/eval-hooks/nextjs-mcp-pre.sh):
#!/bin/bash
echo "🔧 Setting up Next.js MCP server for $EVAL_NAME"
claude mcp add -t http nextjs-dev-$EVAL_NAME http://localhost:$PORT/_next/mcp
echo "✅ MCP server configured"Post-eval (scripts/eval-hooks/nextjs-mcp-post.sh):
#!/bin/bash
echo "🧹 Cleaning up Next.js MCP server for $EVAL_NAME"
claude mcp remove nextjs-dev-$EVAL_NAME
echo "✅ MCP server removed"-
Create your script in
scripts/eval-hooks/:touch scripts/eval-hooks/my-custom-pre.sh chmod +x scripts/eval-hooks/my-custom-pre.sh
-
Add your setup logic using the environment variables
-
Use in evals:
bun cli.ts --eval 001-server-component --claude-code \ --pre-eval ./scripts/eval-hooks/my-custom-pre.sh