Skip to content

Conversation

@sean-kuzco
Copy link

@sean-kuzco sean-kuzco commented Nov 21, 2025

Summary

Review and create solutions for all of the NextJS evals.

For the first 20, I manually completed them using the Codex CLI with Claude Code followup. For the final 30, I completed them with a system prompt and Claude Code.

All the eval solutions pass all checks (lint, build and test).

Details

This PR contains:

  • Code diff for some fragile test files which were changed to be more robust.
  • Markdown documents detailing how the solutions were implemented.

Takeaways

Out of the box, this eval approach has a few aspects which make it hard to score well:

  • Some prompts have some ambiguity.
  • Some tests had fragile/hard-coded assumptions.
  • The AI "agent" has to one shot solutions, with a limited system prompt.

Claude Code, with minimal supporting instructions was able to get all of these evals to a 100% passing rate with minimal effort.

I expect a SOTA coding agent (Codex, Claude Code, etc.) with a robust system prompt, access to search NextJS documentation for the NextJS version it's working with, and a tool-use loop (e.g. it can run lint/build/test to evaluate code changes) would perform extremely well on the task of writing NextJS code.

I think crafting an eval that such an agent could not pass would be exceedingly hard.

@sean-kuzco sean-kuzco marked this pull request as ready for review November 21, 2025 22:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant