NextJS Evals Review #21
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
Review and create solutions for all of the NextJS evals.
For the first 20, I manually completed them using the Codex CLI with Claude Code followup. For the final 30, I completed them with a system prompt and Claude Code.
All the eval solutions pass all checks (lint, build and test).
Details
This PR contains:
Takeaways
Out of the box, this eval approach has a few aspects which make it hard to score well:
Claude Code, with minimal supporting instructions was able to get all of these evals to a 100% passing rate with minimal effort.
I expect a SOTA coding agent (Codex, Claude Code, etc.) with a robust system prompt, access to search NextJS documentation for the NextJS version it's working with, and a tool-use loop (e.g. it can run lint/build/test to evaluate code changes) would perform extremely well on the task of writing NextJS code.
I think crafting an eval that such an agent could not pass would be exceedingly hard.