-
Notifications
You must be signed in to change notification settings - Fork 264
[SWEBench] Add evaluation support for SWEBench #501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
SumanthRH
wants to merge
16
commits into
SWE-agent:main
Choose a base branch
from
SumanthRH:minimal-eval-support
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
Signed-off-by: SumanthRH <[email protected]>
pcmoritz
reviewed
Sep 2, 2025
Signed-off-by: SumanthRH <[email protected]>
SumanthRH
commented
Sep 3, 2025
|
|
||
| - repo: https://github.com/crate-ci/typos | ||
| rev: v1 | ||
| rev: v1.35.6 |
Contributor
Author
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I currently observe pre-commit failures locally probably due to issues with the latest typos package 1.35.7.
Signed-off-by: SumanthRH <[email protected]>
2 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What does this PR do?
An improved version of #487. In summary, this PR adds a minimal evaluation script
swebench_evaltomini-swe-agentto evaluate trajectories for SWEBench.Why is this needed?
The current way for evaluating trajectories for the SWEBench task is to save generated trajectories and run it with the
run_evaluationscript in the SWEBench repo: https://github.com/SWE-bench/SWE-benchThere are a couple reasons to have a evaluation functionality in
mini-swe-agentitself:Lack of flexible evaluation implementations: While the SWEBench repo is great, it is inflexible at the moment, with support for only the docker backend. Remote evaluation support with
sb-cliis a good step in making evaluation more accessible, but it is ultimately not the best solution given that(a) users want flexibility to scale out evaluations on their infra as needed . This is especially desirable for RL.
(b) it can be hard to maintain (from [SWEBench] Add support for SWEBench evaluation #487 (comment))
Lack of minimal evaluation implementations: Similar to how Mini-SWE-Agent has provided a great reference for implementing a minimal AI agent, it is desirable to have a good standalone minimal evaluation implementation for future research and adaptability to other tasks.
Unified generation + evaluation for RL: For RL training, it is desirable to have a single framework that can support generation as well as evaluation.
Based on the above reasons, and given Mini-SWE-Agent's flexibility with different backends like Docker, Apptainer, BubbleWrap,etc, we think it is great to have evaluation functionality in Mini-SWE-Agent. The primary goal is to easily allow users to evaluate SWEBench trajectories on their infra. The implementation is also kept to be minimal to enable future research and easy experimentation.
Implementation
This PR adds a
swebench-evalCLI tomini-extraAssumptions
We assume the following;
mini-extrapreviously, and have a output directory with apreds.jsonfileeval_scriptcolumn, that contains the evaluation script to run for that instance. For reference: https://huggingface.co/datasets/sumanthrh/swe-bench_verifiedOverview
swebench-evalperforms the following for each instance:preds.jsonfile.Example usage
Design proposed by @pcmoritz