This repository hosts public evaluation suites used by Clerk to test how LLMs perform at writing Clerk code (primarily in Next.js). If an AI contributor is asked to "create a new eval suite for the Waitlist feature", it should add a new folder under src/evals/ with a PROMPT.md and graders.ts, then register it in src/index.ts.
Install Bun >=1.3.0, then gather the required API keys. See .env.example
cp .env.example .envRun the eval suite (might take about 50s)
bun i
bun startFor detailed, copy-pastable steps see docs/ADDING_EVALS.md. In short:
- Create
src/evals/your-eval/withPROMPT.mdandgraders.ts. - Implement graders that return booleans using
defineGraders(...)and shared judges in@/src/graders/catalog. - Append an entry to the
evaluationsarray insrc/index.tswithframework,category, andpath(e.g.,evals/waitlist). - Run
bun run start:eval src/evals/your-eval(optionally--debug).
<details>
<summary>Example scores</summary>
```json
[
{
"model": "gpt-5-chat-latest",
"framework": "Next.js",
"category": "Fundamentals",
"value": 0.6666666666666666,
"updatedAt": "2025-10-15T17:51:27.901Z"
},
{
"model": "gpt-4o",
"framework": "Next.js",
"category": "Fundamentals",
"value": 0.3333333333333333,
"updatedAt": "2025-10-15T17:51:30.871Z"
},
{
"model": "claude-sonnet-4-0",
"framework": "Next.js",
"category": "Fundamentals",
"value": 0.5,
"updatedAt": "2025-10-15T17:51:56.370Z"
},
{
"model": "claude-sonnet-4-5",
"framework": "Next.js",
"category": "Fundamentals",
"value": 0.8333333333333334,
"updatedAt": "2025-10-15T17:52:03.349Z"
},
{
"model": "v0-1.5-md",
"framework": "Next.js",
"category": "Fundamentals",
"value": 1,
"updatedAt": "2025-10-15T17:52:06.700Z"
},
{
"model": "claude-opus-4-0",
"framework": "Next.js",
"category": "Fundamentals",
"value": 0.5,
"updatedAt": "2025-10-15T17:52:06.898Z"
},
{
"model": "gpt-5",
"framework": "Next.js",
"category": "Fundamentals",
"value": 0.5,
"updatedAt": "2025-10-15T17:52:07.038Z"
}
]
Debuging
# Run a single evaluation
bun run start:eval evals/apiroutes
# Run in debug mode
bun run start --debug
# Run a single evaluation in debug mode
bun run start:eval evals/apiroutes --debugThis project is broken up into a few core pieces:
src/index.ts: This is the main entrypoint of the project. Evaluations, models, reporters, and the runner are registered here, and all executed./evals: Folders that contain a prompt and grading expectations. Runners currently assume that eval folders contain two files:graders.tsandPROMPT.md./runners: The primary logic responsible for loading evaluations, calling provider llms, and outputting scores./reporters: The primary logic responsible for sending scores somewhere — stdout, a file, etc.
A runner takes a simple object as an argument:
It will resolve the provider and model to the respective SDK.
It will load the designated evaluation, generate LLM text from the prompt, and pass the result to graders.
At the moment, evaluations are simply folders that contain:
PROMPT.md: the instruction for which we're evaluating the model's output ongraders.ts: a module containing grader functions which returntrue/falsesignalling if the model's output passed or failed. This is essentially our acceptance criteria.
Shared grader primitives live in src/graders/index.ts. Use them to declare new checks with a consistent, terse shape:
import { contains, defineGraders, judge } from '@/src/graders'
import { llmChecks } from '@/src/graders/catalog'
export const graders = defineGraders({
references_middleware: contains('middleware.ts'),
package_json: llmChecks.packageJsonClerkVersion,
custom_flow_description: judge(
'Does the answer walk through protecting a Next.js API route with Clerk auth() and explain the response states?',
),
})contains/containsAny: case-insensitive substring checks by defaultmatches: regex checksjudge: thin wrappers around the LLM-as-judge scorer. Shared prompts live insrc/graders/catalog.ts; add new reusable prompts there.defineGraders: preserves type inference for the exportedgradersrecord.
For a given model, and evaluation, we'll retrieve a score from 0..1, which is the percentage of grader functions that passed.
At the moment, we employ two minimal reporters
For the notable interfaces, see /interfaces.

{ "provider": "openai", "model": "gpt-5", "evalPath": "/absolute/path/to/clerk-evals/src/evals/basic-nextjs }