Skip to content

[Leaderboard] Column "Model" displays Agent Systems #40

@Chesars

Description

@Chesars

Summary

The leaderboard has a column labeled "Model" but it displays agent system names, not model names.

Details

Both Anthropic and OpenAI documentation explain that SWE-bench evaluates complete agent systems, not just models:

From Anthropic (source):

"SWE-bench doesn't just evaluate the AI model in isolation, but rather an entire 'agent' system. In this context, an 'agent' refers to the
combination of an AI model and the software scaffolding around it. This scaffolding is responsible for generating the prompts that go into the model, parsing the model's output to take action, and managing the interaction loop where the result of the model's previous action is
incorporated into its next prompt. The performance of an agent on SWE-bench can vary significantly based on this scaffolding, even when using the same underlying AI model."

From OpenAI (source):

"Models operate within scaffolds: frameworks providing specialized tools, structured prompts, and controlled environments that enable models to edit files and navigate directories"

Current leaderboard

The "Model" column shows entries like:

  • TRAE ← Agent system name, not a model
  • OpenHands ← Agent system name, not a model
  • JoyCode ← Same here
  • Refact.ai Agent ← Same here
  • Warp ← Same here
  • OpenHands + GPT-5 ← Agent + model

Only some entries include the actual model name, making it unclear what the standard is.

Why this matters

  1. Misleading header: Users think they're comparing models, but they're comparing agent systems (scaffolding + model)

  2. Missing information: Users can't tell which model powers each agent without diving into metadata files

  3. Inconsistent standards: The checklist requires model disclosure (line 77-78), but this isn't reflected in the display

Examples

Display Name What's actually being evaluated Model used
TRAE TRAE agent system 4 models: Claude 4 Sonnet/Opus/3.7 + Gemini 2.5
OpenHands OpenHands agent system Claude 3.7 (only mentioned in README)
OpenHands + GPT-5 OpenHands agent system GPT-5 ✅

Proposed solutions

Option 1 (simplest): Rename column header
"Model""Agent System" or "Agent"

Option 2: Add agent column
| Agent | Model(s) | ... |

Option 3: Require model in display name
Format: <Agent> + <Model(s)>

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions