Skip to content

Conversation

@SumanthRH
Copy link
Contributor

What does this PR do?

An improved version of #487. In summary, this PR adds a minimal evaluation script swebench_eval to mini-swe-agent to evaluate trajectories for SWEBench.

Why is this needed?

The current way for evaluating trajectories for the SWEBench task is to save generated trajectories and run it with the run_evaluation script in the SWEBench repo: https://github.com/SWE-bench/SWE-bench

There are a couple reasons to have a evaluation functionality in mini-swe-agent itself:

  1. Lack of flexible evaluation implementations: While the SWEBench repo is great, it is inflexible at the moment, with support for only the docker backend. Remote evaluation support with sb-cli is a good step in making evaluation more accessible, but it is ultimately not the best solution given that
    (a) users want flexibility to scale out evaluations on their infra as needed . This is especially desirable for RL.
    (b) it can be hard to maintain (from [SWEBench] Add support for SWEBench evaluation #487 (comment))

  2. Lack of minimal evaluation implementations: Similar to how Mini-SWE-Agent has provided a great reference for implementing a minimal AI agent, it is desirable to have a good standalone minimal evaluation implementation for future research and adaptability to other tasks.

  3. Unified generation + evaluation for RL: For RL training, it is desirable to have a single framework that can support generation as well as evaluation.

Based on the above reasons, and given Mini-SWE-Agent's flexibility with different backends like Docker, Apptainer, BubbleWrap,etc, we think it is great to have evaluation functionality in Mini-SWE-Agent. The primary goal is to easily allow users to evaluate SWEBench trajectories on their infra. The implementation is also kept to be minimal to enable future research and easy experimentation.

Implementation

This PR adds a swebench-eval CLI to mini-extra

mini-extra swebench-eval --help

Assumptions

We assume the following;

  1. Users have generated trajectories for a SWEBench-like task with mini-extra previously, and have a output directory with a preds.json file
  2. Users have prepared their dataset to include an eval_script column, that contains the evaluation script to run for that instance. For reference: https://huggingface.co/datasets/sumanthrh/swe-bench_verified

Overview

swebench-eval performs the following for each instance:

  • Load the trajectory from the preds.json file.
  • Initialize the environment for the given instance using the given backend.
  • Apply the model's git patch to the working directory in the environment.
  • Run the evaluation script for the instance. If the script runs successfully, the instance is considered to be resolved, and unresolved otherwise.

Example usage

uv run --isolated --extra full  -- mini-extra swebench-eval --model openai/gpt-5-nano --dataset SumanthRH/SWE-Bench_Verified --split test --workers 1 --output /path/to/output/dir --environment-class singularity --config '/path/to/swebench_gpt5.yaml'  

Design proposed by @pcmoritz

Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
x
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
@SumanthRH SumanthRH marked this pull request as ready for review September 3, 2025 00:00

- repo: https://github.com/crate-ci/typos
rev: v1
rev: v1.35.6
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I currently observe pre-commit failures locally probably due to issues with the latest typos package 1.35.7.

x
Signed-off-by: SumanthRH <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants