[SWEBench] Add evaluation support for SWEBench #501

SumanthRH · 2025-09-02T22:51:49Z

What does this PR do?

An improved version of #487. In summary, this PR adds a minimal evaluation script swebench_eval to mini-swe-agent to evaluate trajectories for SWEBench.

Why is this needed?

The current way for evaluating trajectories for the SWEBench task is to save generated trajectories and run it with the run_evaluation script in the SWEBench repo: https://github.com/SWE-bench/SWE-bench

There are a couple reasons to have a evaluation functionality in mini-swe-agent itself:

Lack of flexible evaluation implementations: While the SWEBench repo is great, it is inflexible at the moment, with support for only the docker backend. Remote evaluation support with sb-cli is a good step in making evaluation more accessible, but it is ultimately not the best solution given that
(a) users want flexibility to scale out evaluations on their infra as needed . This is especially desirable for RL.
(b) it can be hard to maintain (from [SWEBench] Add support for SWEBench evaluation #487 (comment))
Lack of minimal evaluation implementations: Similar to how Mini-SWE-Agent has provided a great reference for implementing a minimal AI agent, it is desirable to have a good standalone minimal evaluation implementation for future research and adaptability to other tasks.
Unified generation + evaluation for RL: For RL training, it is desirable to have a single framework that can support generation as well as evaluation.

Based on the above reasons, and given Mini-SWE-Agent's flexibility with different backends like Docker, Apptainer, BubbleWrap,etc, we think it is great to have evaluation functionality in Mini-SWE-Agent. The primary goal is to easily allow users to evaluate SWEBench trajectories on their infra. The implementation is also kept to be minimal to enable future research and easy experimentation.

Implementation

This PR adds a swebench-eval CLI to mini-extra

mini-extra swebench-eval --help

Assumptions

We assume the following;

Users have generated trajectories for a SWEBench-like task with mini-extra previously, and have a output directory with a preds.json file
Users have prepared their dataset to include an eval_script column, that contains the evaluation script to run for that instance. For reference: https://huggingface.co/datasets/sumanthrh/swe-bench_verified

Overview

swebench-eval performs the following for each instance:

Load the trajectory from the preds.json file.
Initialize the environment for the given instance using the given backend.
Apply the model's git patch to the working directory in the environment.
Run the evaluation script for the instance. If the script runs successfully, the instance is considered to be resolved, and unresolved otherwise.

Example usage

uv run --isolated --extra full  -- mini-extra swebench-eval --model openai/gpt-5-nano --dataset SumanthRH/SWE-Bench_Verified --split test --workers 1 --output /path/to/output/dir --environment-class singularity --config '/path/to/swebench_gpt5.yaml'

Design proposed by @pcmoritz

Signed-off-by: SumanthRH <[email protected]>

src/minisweagent/config/extra/swebench_eval.yaml

Signed-off-by: SumanthRH <[email protected]>

SumanthRH · 2025-09-03T00:00:40Z

.pre-commit-config.yaml


  - repo: https://github.com/crate-ci/typos
-    rev: v1
+    rev: v1.35.6


I currently observe pre-commit failures locally probably due to issues with the latest typos package 1.35.7.

Signed-off-by: SumanthRH <[email protected]>

SumanthRH added 14 commits August 29, 2025 19:06

initial commit

54a3d84

Signed-off-by: SumanthRH <[email protected]>

init

e3bf877

Signed-off-by: SumanthRH <[email protected]>

x

0927bd1

Signed-off-by: SumanthRH <[email protected]>

swe

2b8c37b

Signed-off-by: SumanthRH <[email protected]>

x

5603a1e

Signed-off-by: SumanthRH <[email protected]>

x

f504ade

Signed-off-by: SumanthRH <[email protected]>

x

9d0b05d

Signed-off-by: SumanthRH <[email protected]>

x

bb9c0bd

Signed-off-by: SumanthRH <[email protected]>

x

7ede16e

Signed-off-by: SumanthRH <[email protected]>

WIP minimal eval

3985bde

Signed-off-by: SumanthRH <[email protected]>

x

78b8eb5

Signed-off-by: SumanthRH <[email protected]>

update

8c156d6

Signed-off-by: SumanthRH <[email protected]>

more cleanup

56d0ba0

Signed-off-by: SumanthRH <[email protected]>

use patch in line

bb1cbb5

Signed-off-by: SumanthRH <[email protected]>

pcmoritz reviewed Sep 2, 2025

View reviewed changes

src/minisweagent/config/extra/swebench_eval.yaml Outdated Show resolved Hide resolved

remove separate yaml

3d6fe48

Signed-off-by: SumanthRH <[email protected]>

SumanthRH marked this pull request as ready for review September 3, 2025 00:00

SumanthRH commented Sep 3, 2025

View reviewed changes

x

7730532

Signed-off-by: SumanthRH <[email protected]>

pcmoritz mentioned this pull request Sep 3, 2025

[SWEBench] Add support for SWEBench evaluation #487

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SWEBench] Add evaluation support for SWEBench #501

[SWEBench] Add evaluation support for SWEBench #501

Uh oh!

SumanthRH commented Sep 2, 2025

Uh oh!

Uh oh!

SumanthRH Sep 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

[SWEBench] Add evaluation support for SWEBench #501

Are you sure you want to change the base?

[SWEBench] Add evaluation support for SWEBench #501

Uh oh!

Conversation

SumanthRH commented Sep 2, 2025

What does this PR do?

Why is this needed?

Implementation

Assumptions

Overview

Example usage

Uh oh!

Uh oh!

SumanthRH Sep 3, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants