[Leaderboard] Column "Model" displays Agent Systems


## Summary
  The leaderboard has a column labeled "**Model**" but it displays **agent system names**, not model names.

  ## Details

  Both Anthropic and OpenAI documentation explain that SWE-bench evaluates complete agent systems, not just models:

  **From Anthropic** ([source](https://www.anthropic.com/engineering/swe-bench-sonnet)):
  > "SWE-bench doesn't just evaluate the AI model in isolation, but rather an entire 'agent' system. In this context, an 'agent' refers to the
  **combination of an AI model and the software scaffolding around it**. This scaffolding is responsible for generating the prompts that go into the model, parsing the model's output to take action, and managing the interaction loop where the result of the model's previous action is
  incorporated into its next prompt. **The performance of an agent on SWE-bench can vary significantly based on this scaffolding, even when using the same underlying AI model.**"

  **From OpenAI** ([source](https://openai.com/index/introducing-swe-bench-verified/)):
  > "Models operate within scaffolds: frameworks providing specialized tools, structured prompts, and controlled environments that enable models to edit files and navigate directories"

  ## Current leaderboard

  The "Model" column shows entries like:
  - `TRAE` ← Agent system name, not a model
  - `OpenHands` ← Agent system name, not a model
  - `JoyCode` ← Same here
  - `Refact.ai Agent` ← Same here 
  - `Warp` ← Same here
  - `OpenHands + GPT-5` ← Agent + model

  **Only some entries include the actual model name**, making it unclear what the standard is.

  ## Why this matters

  1. **Misleading header**: Users think they're comparing models, but they're comparing agent systems (scaffolding + model)

  2. **Missing information**: Users can't tell which model powers each agent without diving into metadata files

  3. **Inconsistent standards**: The checklist requires model disclosure (line 77-78), but this isn't reflected in the display

  ## Examples

  | Display Name | What's actually being evaluated | Model used |
  |--------------|-------------------------------|------------|
  | `TRAE` | TRAE agent system | 4 models: Claude 4 Sonnet/Opus/3.7 + Gemini 2.5 |
  | `OpenHands` | OpenHands agent system | Claude 3.7 (only mentioned in README) |
  | `OpenHands + GPT-5` | OpenHands agent system | GPT-5 ✅ |

  ## Proposed solutions

  **Option 1 (simplest):** Rename column header
  `"Model"` → `"Agent System"` or `"Agent"`

  **Option 2:** Add agent column
  | Agent | Model(s) | ... |

  **Option 3:** Require model in display name
  Format: `<Agent> + <Model(s)>`

  ## References

  - Anthropic on scaffolding: https://www.anthropic.com/engineering/swe-bench-sonnet
  - OpenAI on scaffolds: https://openai.com/index/introducing-swe-bench-verified/
  - Checklist requirement: `checklist.md` lines 77-78, from Experiments Repo.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Leaderboard] Column "Model" displays Agent Systems #40

Summary

Details

Current leaderboard

Why this matters

Examples

Proposed solutions

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Display Name	What's actually being evaluated	Model used
`TRAE`	TRAE agent system	4 models: Claude 4 Sonnet/Opus/3.7 + Gemini 2.5
`OpenHands`	OpenHands agent system	Claude 3.7 (only mentioned in README)
`OpenHands + GPT-5`	OpenHands agent system	GPT-5 ✅

[Leaderboard] Column "Model" displays Agent Systems #40

Description

Summary

Details

Current leaderboard

Why this matters

Examples

Proposed solutions

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions