NextJS Evals Review #21

sean-kuzco · 2025-11-21T21:33:33Z

Summary

Review and create solutions for all of the NextJS evals.

For the first 20, I manually completed them using the Codex CLI with Claude Code followup. For the final 30, I completed them with a system prompt and Claude Code.

All the eval solutions pass all checks (lint, build and test).

Details

This PR contains:

Code diff for some fragile test files which were changed to be more robust.
Markdown documents detailing how the solutions were implemented.

Takeaways

Out of the box, this eval approach has a few aspects which make it hard to score well:

Some prompts have some ambiguity.
Some tests had fragile/hard-coded assumptions.
The AI "agent" has to one shot solutions, with a limited system prompt.

Claude Code, with minimal supporting instructions was able to get all of these evals to a 100% passing rate with minimal effort.

I expect a SOTA coding agent (Codex, Claude Code, etc.) with a robust system prompt, access to search NextJS documentation for the NextJS version it's working with, and a tool-use loop (e.g. it can run lint/build/test to evaluate code changes) would perform extremely well on the task of writing NextJS code.

I think crafting an eval that such an agent could not pass would be exceedingly hard.

sean-kuzco added 7 commits November 21, 2025 13:33

Initial manual eval review

95bb3b0

Remove script

0fbfa3a

Add solutions for remaining evals

c6d051a

Rename script

d9fac36

Remove temporary file

4b37611

Remove temporary file

52b83aa

Update gitignore

14336d1

sean-kuzco marked this pull request as ready for review November 21, 2025 22:26

Fix formatting

17c535a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NextJS Evals Review #21

NextJS Evals Review #21

Uh oh!

sean-kuzco commented Nov 21, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

NextJS Evals Review #21

Are you sure you want to change the base?

NextJS Evals Review #21

Uh oh!

Conversation

sean-kuzco commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

Takeaways

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

sean-kuzco commented Nov 21, 2025 •

edited

Loading