-
Notifications
You must be signed in to change notification settings - Fork 267
[SWEBench] Add evaluation support for SWEBench #501
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
SumanthRH
wants to merge
16
commits into
SWE-agent:main
Choose a base branch
from
SumanthRH:minimal-eval-support
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 15 commits
Commits
Show all changes
16 commits
Select commit
Hold shift + click to select a range
54a3d84
initial commit
SumanthRH e3bf877
init
SumanthRH 0927bd1
x
SumanthRH 2b8c37b
swe
SumanthRH 5603a1e
x
SumanthRH f504ade
x
SumanthRH 9d0b05d
x
SumanthRH bb9c0bd
x
SumanthRH 7ede16e
x
SumanthRH 3985bde
WIP minimal eval
SumanthRH 78b8eb5
x
SumanthRH 8c156d6
update
SumanthRH 56d0ba0
more cleanup
SumanthRH bb1cbb5
use patch in line
SumanthRH 3d6fe48
remove separate yaml
SumanthRH 7730532
x
SumanthRH File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,83 @@ | ||
| # SWE-bench Evaluation | ||
|
|
||
| !!! abstract "Overview" | ||
|
|
||
| * Mini-SWE-Agent provides a native script for evaluating trajectories on SWE-bench. | ||
| * The evaluation script is designed to be minimalistic, and meant to be used in a batch mode. | ||
|
|
||
| ## Usage | ||
|
|
||
| !!! tip "Quickstart" | ||
|
|
||
| ```bash | ||
| mini-extra swebench-eval --help | ||
| # or | ||
| python src/minisweagent/run/extra/swebench_eval.py --help | ||
| # Example: | ||
| mini-extra swebench-eval \ | ||
| --model claude-sonnet-4-20250514 \ | ||
| --split test \ | ||
| --workers 4 | ||
| --dataset SumanthRH/SWE-Bench_Verified | ||
| ``` | ||
|
|
||
| Basic flags: | ||
|
|
||
| - `-o`, `--output` - Output directory containing generated trajectories in a `preds.json` file. This is the output from `mini-extra swebench` or `mini-extra swebench-single`. | ||
| - `-m`, `--model` - Model to use | ||
| - `-c`, `--config` - Path to a config file (default: `swebench.yaml` in the `config` directory) | ||
| - `-w`, `--workers` - Number of worker threads for parallel processing (default: `1`) | ||
|
|
||
| Data selection flags: | ||
|
|
||
| - `--dataset` - SWEBench dataset to use or path to a dataset. In addition to the standard SWEBench dataset columns, this should contain a `eval_script` column that contains the evaluation script for that instance. (default: `SumanthRH/SWE-Bench_Verified`) | ||
| - `--split` - Dataset split (default: `dev`) | ||
| - `--slice` - Slice specification (e.g., '0:5' for first 5 instances) | ||
| - `--filter` - Filter instance IDs by regex | ||
| - `--shuffle` - Shuffle instances (default: `False`) | ||
| - `--redo-existing` - Redo existing instances (default: `False`) | ||
|
|
||
| Advanced flags: | ||
|
|
||
| - `--environment-class` - Environment type to use (recommended: `docker` or `singularity`) | ||
|
|
||
|
|
||
| ## Design | ||
|
|
||
| `swebench-eval` performs the following for each instance: | ||
|
|
||
| * Load the trajectory from the `preds.json` file. | ||
| * Initialize the environment for the given instance using the given backend. | ||
| * Apply the model's git patch to the working directory in the environment. | ||
| * Run the evaluation script for the instance. If the script runs successfully, the instance is considered to be resolved, and unresolved otherwise. | ||
|
|
||
| !!! tip "Preparing the evaluation script" | ||
|
|
||
| `swebench-eval` only checks to see if the final return code on running the evaluation script is 0. Ideally, your evaluation script should be written so that this final return code indicates the success or failure of the evaluation. See [SumanthRH/SWE-Bench_Verified](https://huggingface.co/datasets/SumanthRH/SWE-bench_Verified) for reference. | ||
|
|
||
| !!! tip "Size of the git patch" | ||
|
|
||
| `swebench-eval` applies the model's git patch by providing it as an in-line argument to the `git apply` command. This means that the size of the git patch is limited to the OS limits for `ARG_MAX`. | ||
| In modern systems, this is typically ~ 1 MB, which is pretty generous. | ||
| For simplicity, we assume that large patches greater than `ARG_MAX` are meant to fail. | ||
|
|
||
| ## Implementation | ||
|
|
||
| ??? note "Default config" | ||
|
|
||
| - [Read on GitHub](https://github.com/swe-agent/mini-swe-agent/blob/main/src/minisweagent/config/extra/swebench.yaml) | ||
|
|
||
| ```yaml | ||
| --8<-- "src/minisweagent/config/extra/swebench.yaml" | ||
| ``` | ||
|
|
||
| ??? note "`swebench_eval.py` run script" | ||
|
|
||
| - [Read on GitHub](https://github.com/swe-agent/mini-swe-agent/blob/main/src/minisweagent/run/extra/swebench_eval.py) | ||
| - [API reference](../reference/run/swebench_eval.md) | ||
|
|
||
| ```python | ||
| --8<-- "src/minisweagent/run/extra/swebench_eval.py" | ||
| ``` | ||
|
|
||
| {% include-markdown "../_footer.md" %} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,5 @@ | ||
| # SWE-bench | ||
|
|
||
| ::: minisweagent.run.extra.swebench_eval | ||
|
|
||
| {% include-markdown "../../_footer.md" %} |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,185 @@ | ||
| #!/usr/bin/env python3 | ||
|
|
||
| """Evaluate mini-SWE-agent trajectories for SWE-bench.""" | ||
| # Read this first: https://mini-swe-agent.com/latest/usage/swebench/ (usage docs) | ||
|
|
||
| import concurrent.futures | ||
| import json | ||
| import threading | ||
| import time | ||
| import traceback | ||
| import uuid | ||
| from pathlib import Path | ||
| from typing import Any | ||
|
|
||
| import typer | ||
| import yaml | ||
| from datasets import load_dataset | ||
| from rich.live import Live | ||
|
|
||
| from minisweagent.config import builtin_config_dir, get_config_path | ||
| from minisweagent.models import get_model | ||
| from minisweagent.run.extra.swebench import filter_instances, get_sb_environment | ||
| from minisweagent.run.extra.utils.batch_progress import RunBatchProgressManager | ||
| from minisweagent.utils.log import add_file_handler, logger | ||
|
|
||
| _HELP_TEXT = """Evaluate mini-SWE-agent trajectories for SWEBench.""" | ||
|
|
||
| app = typer.Typer(rich_markup_mode="rich", add_completion=False) | ||
|
|
||
|
|
||
| _OUTPUT_FILE_LOCK = threading.Lock() | ||
|
|
||
|
|
||
| def update_evals_file(output_path: Path, instance_id: str, model_name: str, model_patch: str, eval_report: dict): | ||
| """Update the output JSON file with results from a single instance.""" | ||
| with _OUTPUT_FILE_LOCK: | ||
| output_data = {} | ||
| if output_path.exists(): | ||
| output_data = json.loads(output_path.read_text()) | ||
| output_data[instance_id] = { | ||
| "model_name_or_path": model_name, | ||
| "instance_id": instance_id, | ||
| "model_patch": model_patch, | ||
| "eval_report": eval_report, | ||
| } | ||
| output_path.write_text(json.dumps(output_data, indent=2)) | ||
|
|
||
|
|
||
| def remove_from_evals_file(output_path: Path, instance_id: str): | ||
| """Remove an instance from the predictions file.""" | ||
| if not output_path.exists(): | ||
| return | ||
| with _OUTPUT_FILE_LOCK: | ||
| output_data = json.loads(output_path.read_text()) | ||
| if instance_id in output_data: | ||
| del output_data[instance_id] | ||
| output_path.write_text(json.dumps(output_data, indent=2)) | ||
|
|
||
|
|
||
| def evaluate_instance( | ||
| instance: dict, | ||
| output_dir: Path, | ||
| instance_result: dict[str, Any], | ||
| config: dict, | ||
| ) -> None: | ||
| """Process a single SWEBench instance.""" | ||
| instance_id = instance["instance_id"] | ||
| output_path = output_dir / "evals.json" | ||
|
|
||
| # avoid inconsistent state if something here fails and there's leftover previous files | ||
| remove_from_evals_file(output_path, instance_id) | ||
| model = get_model(config=config.get("model", {})) | ||
|
|
||
| ret = {"instance_id": instance_id, "resolved": False, "eval_error": None} | ||
|
|
||
| env = None | ||
| try: | ||
| env = get_sb_environment(config, instance) | ||
| except Exception as e: | ||
| ret["eval_error"] = f"Env creation failed with {e}" | ||
| logger.info(f"Starting environment failed with exception: {e}\n, {traceback.format_exc()}") | ||
| update_evals_file(output_path, instance_id, model.config.model_name, instance_result["model_patch"], ret) | ||
| return | ||
|
|
||
| # apply git patch | ||
| # NOTE (sumanthrh): This applies patch in-line, and the maximum patch size is limited by the OS limits for `ARG_MAX`. | ||
| # In modern systems, this is typically ~ 1 MB, which is pretty generous. | ||
| # For simplicity, we assume that large patches greater than `ARG_MAX` are meant to fail | ||
| delimiter = f"PATCH_{uuid.uuid4().hex}" # unlikely to collide with symbols in the patch | ||
| command = f"git apply <<'{delimiter}'\n{instance_result['model_patch']}\n{delimiter}" | ||
|
|
||
| obs = env.execute(command) | ||
|
|
||
| if obs["returncode"] != 0: | ||
| ret["eval_error"] = obs["output"] | ||
| else: | ||
| # run eval script in-line | ||
| eval_script = instance["eval_script"] | ||
| eval_cmd = f"bash <<'EOF'\n{eval_script}\nEOF" | ||
| obs = env.execute(eval_cmd) | ||
| # use the return value | ||
| ret["resolved"] = obs["returncode"] == 0 | ||
| ret["eval_error"] = obs["output"] if not ret["resolved"] else None | ||
| update_evals_file(output_path, instance_id, model.config.model_name, instance_result["model_patch"], ret) | ||
|
|
||
|
|
||
| # fmt: off | ||
| @app.command(help=_HELP_TEXT) | ||
| def main( | ||
| dataset: str = typer.Option("SumanthRH/SWE-Bench_Verified", "--dataset", help="Path to the SWEBench dataset to use. Should include a `eval_script` column specifying the evaluation script for each instance", rich_help_panel="Data selection"), | ||
| split: str = typer.Option("dev", "--split", help="Dataset split", rich_help_panel="Data selection"), | ||
| slice_spec: str = typer.Option("", "--slice", help="Slice specification (e.g., '0:5' for first 5 instances)", rich_help_panel="Data selection"), | ||
| filter_spec: str = typer.Option("", "--filter", help="Filter instance IDs by regex", rich_help_panel="Data selection"), | ||
| shuffle: bool = typer.Option(False, "--shuffle", help="Shuffle instances", rich_help_panel="Data selection"), | ||
| output: str = typer.Option("", "-o", "--output", help="Output directory. Should contain a preds.json file with model predictions", rich_help_panel="Basic"), | ||
| workers: int = typer.Option(1, "-w", "--workers", help="Number of worker threads for parallel processing", rich_help_panel="Basic"), | ||
| model: str | None = typer.Option(None, "-m", "--model", help="Model to use", rich_help_panel="Basic"), | ||
| model_class: str | None = typer.Option(None, "-c", "--model-class", help="Model class to use (e.g., 'anthropic' or 'minisweagent.models.anthropic.AnthropicModel')", rich_help_panel="Advanced"), | ||
| redo_existing: bool = typer.Option(False, "--redo-existing", help="Redo existing instances", rich_help_panel="Data selection"), | ||
| config_spec: Path = typer.Option( builtin_config_dir / "extra" / "swebench.yaml", "-c", "--config", help="Path to a config file", rich_help_panel="Basic"), | ||
| environment_class: str | None = typer.Option( None, "--environment-class", help="Environment type to use. Recommended are docker or singularity", rich_help_panel="Advanced"), | ||
| ) -> None: | ||
| # fmt: on | ||
| output_path = Path(output) | ||
|
|
||
| predictions_file = output_path / "preds.json" | ||
| if not predictions_file.exists(): | ||
| raise FileNotFoundError(f"Expected a `preds.json` file in output directory {output_path}") | ||
| logger.info(f"Results will be saved to {output_path}") | ||
| add_file_handler(output_path / "minisweagent_eval.log") | ||
|
|
||
| logger.info(f"Loading dataset {dataset}, split {split}...") | ||
| instances = list(load_dataset(dataset, split=split)) | ||
|
|
||
| instances = filter_instances(instances, filter_spec=filter_spec, slice_spec=slice_spec, shuffle=shuffle) | ||
| if not redo_existing and (output_path / "evals.json").exists(): | ||
| existing_instances = list(json.loads((output_path / "evals.json").read_text()).keys()) | ||
| logger.info(f"Skipping {len(existing_instances)} existing instances") | ||
| instances = [instance for instance in instances if instance["instance_id"] not in existing_instances] | ||
| logger.info(f"Running on {len(instances)} instances...") | ||
|
|
||
| with open(output_path / "preds.json") as f: | ||
| predictions = json.load(f) | ||
|
|
||
| config = yaml.safe_load(get_config_path(config_spec).read_text()) | ||
| if environment_class is not None: | ||
| config.setdefault("environment", {})["environment_class"] = environment_class | ||
| if model is not None: | ||
| config.setdefault("model", {})["model_name"] = model | ||
| if model_class is not None: | ||
| config.setdefault("model", {})["model_class"] = model_class | ||
|
|
||
| progress_manager = RunBatchProgressManager(len(instances), output_path / f"exit_statuses_eval_{time.time()}.yaml") | ||
|
|
||
| def process_futures(futures: dict[concurrent.futures.Future, str]): | ||
| for future in concurrent.futures.as_completed(futures): | ||
| try: | ||
| future.result() | ||
| except concurrent.futures.CancelledError: | ||
| pass | ||
| except Exception as e: | ||
| instance_id = futures[future] | ||
| logger.error(f"Error in future for instance {instance_id}: {e}", exc_info=True) | ||
| progress_manager.on_uncaught_exception(instance_id, e) | ||
|
|
||
| with Live(progress_manager.render_group, refresh_per_second=4): | ||
| with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor: | ||
| futures = { | ||
| executor.submit(evaluate_instance, instance, output_path, predictions[instance["instance_id"]], config): instance[ | ||
| "instance_id" | ||
| ] | ||
| for instance in instances | ||
| } | ||
| try: | ||
| process_futures(futures) | ||
| except KeyboardInterrupt: | ||
| logger.info("Cancelling all pending jobs. Press ^C again to exit immediately.") | ||
| for future in futures: | ||
| if not future.running() and not future.done(): | ||
| future.cancel() | ||
| process_futures(futures) | ||
|
|
||
|
|
||
| if __name__ == "__main__": | ||
| app() |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I currently observe pre-commit failures locally probably due to issues with the latest typos package 1.35.7.