-
Notifications
You must be signed in to change notification settings - Fork 11
Description
Summary
The leaderboard has a column labeled "Model" but it displays agent system names, not model names.
Details
Both Anthropic and OpenAI documentation explain that SWE-bench evaluates complete agent systems, not just models:
From Anthropic (source):
"SWE-bench doesn't just evaluate the AI model in isolation, but rather an entire 'agent' system. In this context, an 'agent' refers to the
combination of an AI model and the software scaffolding around it. This scaffolding is responsible for generating the prompts that go into the model, parsing the model's output to take action, and managing the interaction loop where the result of the model's previous action is
incorporated into its next prompt. The performance of an agent on SWE-bench can vary significantly based on this scaffolding, even when using the same underlying AI model."
From OpenAI (source):
"Models operate within scaffolds: frameworks providing specialized tools, structured prompts, and controlled environments that enable models to edit files and navigate directories"
Current leaderboard
The "Model" column shows entries like:
TRAE← Agent system name, not a modelOpenHands← Agent system name, not a modelJoyCode← Same hereRefact.ai Agent← Same hereWarp← Same hereOpenHands + GPT-5← Agent + model
Only some entries include the actual model name, making it unclear what the standard is.
Why this matters
-
Misleading header: Users think they're comparing models, but they're comparing agent systems (scaffolding + model)
-
Missing information: Users can't tell which model powers each agent without diving into metadata files
-
Inconsistent standards: The checklist requires model disclosure (line 77-78), but this isn't reflected in the display
Examples
| Display Name | What's actually being evaluated | Model used |
|---|---|---|
TRAE |
TRAE agent system | 4 models: Claude 4 Sonnet/Opus/3.7 + Gemini 2.5 |
OpenHands |
OpenHands agent system | Claude 3.7 (only mentioned in README) |
OpenHands + GPT-5 |
OpenHands agent system | GPT-5 ✅ |
Proposed solutions
Option 1 (simplest): Rename column header
"Model" → "Agent System" or "Agent"
Option 2: Add agent column
| Agent | Model(s) | ... |
Option 3: Require model in display name
Format: <Agent> + <Model(s)>
References
- Anthropic on scaffolding: https://www.anthropic.com/engineering/swe-bench-sonnet
- OpenAI on scaffolds: https://openai.com/index/introducing-swe-bench-verified/
- Checklist requirement:
checklist.mdlines 77-78, from Experiments Repo.