SWE-agent · SumanthRH · Aug 29, 2025 · Aug 29, 2025 · Aug 29, 2025 · Aug 29, 2025
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -15,7 +15,7 @@ repos:
       - id: trailing-whitespace
 
   - repo: https://github.com/crate-ci/typos
-    rev: v1
+    rev: v1.35.6
     hooks:
       - id: typos
         files: \.(py|md|rst|yaml|toml)

diff --git a/docs/advanced/swebench_eval.md b/docs/advanced/swebench_eval.md
@@ -0,0 +1,83 @@
+# SWE-bench Evaluation
+
+!!! abstract "Overview"
+
+    * Mini-SWE-Agent provides a native script for evaluating trajectories on SWE-bench.
+    * The evaluation script is designed to be minimalistic, and meant to be used in a batch mode.
+
+## Usage
+
+!!! tip "Quickstart"
+
+        ```bash
+        mini-extra swebench-eval --help
+        # or
+        python src/minisweagent/run/extra/swebench_eval.py --help
+        # Example:
+        mini-extra swebench-eval \
+            --model claude-sonnet-4-20250514 \
+            --split test \
+            --workers 4
+            --dataset SumanthRH/SWE-Bench_Verified
+        ```
+
+        Basic flags:
+
+        - `-o`, `--output` - Output directory containing generated trajectories in a `preds.json` file. This is the output from `mini-extra swebench` or `mini-extra swebench-single`.
+        - `-m`, `--model` - Model to use
+        - `-c`, `--config` - Path to a config file (default: `swebench.yaml` in the `config` directory)
+        - `-w`, `--workers` - Number of worker threads for parallel processing (default: `1`)
+
+        Data selection flags:
+
+        - `--dataset` - SWEBench dataset to use or path to a dataset. In addition to the standard SWEBench dataset columns, this should contain a `eval_script` column that contains the evaluation script for that instance. (default: `SumanthRH/SWE-Bench_Verified`)
+        - `--split` - Dataset split (default: `dev`)
+        - `--slice` - Slice specification (e.g., '0:5' for first 5 instances)
+        - `--filter` - Filter instance IDs by regex
+        - `--shuffle` - Shuffle instances (default: `False`)
+        - `--redo-existing` - Redo existing instances (default: `False`)
+
+        Advanced flags:
+
+        - `--environment-class` - Environment type to use (recommended: `docker` or `singularity`)
+
+
+## Design
+
+`swebench-eval` performs the following for each instance:
+
+* Load the trajectory from the `preds.json` file.
+* Initialize the environment for the given instance using the given backend.
+* Apply the model's git patch to the working directory in the environment.
+* Run the evaluation script for the instance. If the script runs successfully, the instance is considered to be resolved, and unresolved otherwise.
+
+!!! tip "Preparing the evaluation script"
+
+    `swebench-eval` only checks to see if the final return code on running the evaluation script is 0. Ideally, your evaluation script should be written so that this final return code indicates the success or failure of the evaluation. See [SumanthRH/SWE-Bench_Verified](https://huggingface.co/datasets/SumanthRH/SWE-bench_Verified) for reference.
+
+!!! tip "Size of the git patch"
+
+    `swebench-eval` applies the model's git patch by providing it as an in-line argument to the `git apply` command. This means that the size of the git patch is limited to the OS limits for `ARG_MAX`.
+    In modern systems, this is typically ~ 1 MB, which is pretty generous.
+    For simplicity, we assume that large patches greater than `ARG_MAX` are meant to fail.
+
+## Implementation
+
+??? note "Default config"
+
+    - [Read on GitHub](https://github.com/swe-agent/mini-swe-agent/blob/main/src/minisweagent/config/extra/swebench.yaml)
+
+    ```yaml
+    --8<-- "src/minisweagent/config/extra/swebench.yaml"
+    ```
+
+??? note "`swebench_eval.py` run script"
+
+    - [Read on GitHub](https://github.com/swe-agent/mini-swe-agent/blob/main/src/minisweagent/run/extra/swebench_eval.py)
+    - [API reference](../reference/run/swebench_eval.md)
+
+    ```python
+    --8<-- "src/minisweagent/run/extra/swebench_eval.py"
+    ```
+
+{% include-markdown "../_footer.md" %}
diff --git a/docs/reference/run/swebench_eval.md b/docs/reference/run/swebench_eval.md
@@ -0,0 +1,5 @@
+# SWE-bench
+
+::: minisweagent.run.extra.swebench_eval
+
+{% include-markdown "../../_footer.md" %}
diff --git a/docs/usage/swebench.md b/docs/usage/swebench.md
@@ -6,6 +6,7 @@
     * `mini-extra swebench` runs on all task instances in batch mode.
     * `mini-extra swebench-single` runs on a single task instance with interactivity (useful for debugging).
     * You can also take a look at the runscripts to figure out how to build your own batch processing pipeline.
+    * For evaluation, we provide `mini-extra swebench-eval` to evaluate the trajectories on SWE-bench. Refer to [swebench-eval](../advanced/swebench_eval.md) for more details.
 
 <figure markdown="span">
   <div class="gif-container gif-container-styled" data-glightbox-disabled>
@@ -139,7 +140,7 @@ The dataset needs to be loadable as `datasets.load_dataset(path, split=split)`.
 
 > Some progress runners are stuck at 'initializing task' for a very long time
 
-They might be pulling docker containers -- the run sshould start immediately the next time.
+They might be pulling docker containers -- the run should start immediately the next time.
 
 > I have some docker issues
 

diff --git a/mkdocs.yml b/mkdocs.yml
@@ -68,6 +68,7 @@ nav:
       - "Local models": advanced/local_models.md
       - "Control flow": advanced/control_flow.md
       - "Cookbook": advanced/cookbook.md
+      - "SWE-bench Evaluation": advanced/swebench_eval.md
     - "FAQ": faq.md
     - "Contributing": contributing.md
   - API Reference:

diff --git a/src/minisweagent/run/extra/swebench_eval.py b/src/minisweagent/run/extra/swebench_eval.py
@@ -0,0 +1,185 @@
+#!/usr/bin/env python3
+
+"""Evaluate mini-SWE-agent trajectories for SWE-bench."""
+# Read this first: https://mini-swe-agent.com/latest/usage/swebench/  (usage docs)
+
+import concurrent.futures
+import json
+import threading
+import time
+import traceback
+import uuid
+from pathlib import Path
+from typing import Any
+
+import typer
+import yaml
+from datasets import load_dataset
+from rich.live import Live
+
+from minisweagent.config import builtin_config_dir, get_config_path
+from minisweagent.models import get_model
+from minisweagent.run.extra.swebench import filter_instances, get_sb_environment
+from minisweagent.run.extra.utils.batch_progress import RunBatchProgressManager
+from minisweagent.utils.log import add_file_handler, logger
+
+_HELP_TEXT = """Evaluate mini-SWE-agent trajectories for SWEBench."""
+
+app = typer.Typer(rich_markup_mode="rich", add_completion=False)
+
+
+_OUTPUT_FILE_LOCK = threading.Lock()
+
+
+def update_evals_file(output_path: Path, instance_id: str, model_name: str, model_patch: str, eval_report: dict):
+    """Update the output JSON file with results from a single instance."""
+    with _OUTPUT_FILE_LOCK:
+        output_data = {}
+        if output_path.exists():
+            output_data = json.loads(output_path.read_text())
+        output_data[instance_id] = {
+            "model_name_or_path": model_name,
+            "instance_id": instance_id,
+            "model_patch": model_patch,
+            "eval_report": eval_report,
+        }
+        output_path.write_text(json.dumps(output_data, indent=2))
+
+
+def remove_from_evals_file(output_path: Path, instance_id: str):
+    """Remove an instance from the predictions file."""
+    if not output_path.exists():
+        return
+    with _OUTPUT_FILE_LOCK:
+        output_data = json.loads(output_path.read_text())
+        if instance_id in output_data:
+            del output_data[instance_id]
+            output_path.write_text(json.dumps(output_data, indent=2))
+
+
+def evaluate_instance(
+    instance: dict,
+    output_dir: Path,
+    instance_result: dict[str, Any],
+    config: dict,
+) -> None:
+    """Process a single SWEBench instance."""
+    instance_id = instance["instance_id"]
+    output_path = output_dir / "evals.json"
+
+    # avoid inconsistent state if something here fails and there's leftover previous files
+    remove_from_evals_file(output_path, instance_id)
+    model = get_model(config=config.get("model", {}))
+
+    ret = {"instance_id": instance_id, "resolved": False, "eval_error": None}
+
+    env = None
+    try:
+        env = get_sb_environment(config, instance)
+    except Exception as e:
+        ret["eval_error"] = f"Env creation failed with {e}"
+        logger.info(f"Starting environment failed with exception: {e}\n, {traceback.format_exc()}")
+        update_evals_file(output_path, instance_id, model.config.model_name, instance_result["model_patch"], ret)
+        return
+
+    # apply git patch
+    # NOTE (sumanthrh): This applies patch in-line, and the maximum patch size is limited by the OS limits for `ARG_MAX`.
+    # In modern systems, this is typically ~ 1 MB, which is pretty generous.
+    # For simplicity, we assume that large patches greater than `ARG_MAX` are meant to fail
+    delimiter = f"PATCH_{uuid.uuid4().hex}"  # unlikely to collide with symbols in the patch
+    command = f"git apply <<'{delimiter}'\n{instance_result['model_patch']}\n{delimiter}"
+
+    obs = env.execute(command)
+
+    if obs["returncode"] != 0:
+        ret["eval_error"] = obs["output"]
+    else:
+        # run eval script in-line
+        eval_script = instance["eval_script"]
+        eval_cmd = f"bash <<'EOF'\n{eval_script}\nEOF"
+        obs = env.execute(eval_cmd)
+        # use the return value
+        ret["resolved"] = obs["returncode"] == 0
+        ret["eval_error"] = obs["output"] if not ret["resolved"] else None
+    update_evals_file(output_path, instance_id, model.config.model_name, instance_result["model_patch"], ret)
+
+
+# fmt: off
+@app.command(help=_HELP_TEXT)
+def main(
+    dataset: str = typer.Option("SumanthRH/SWE-Bench_Verified", "--dataset", help="Path to the SWEBench dataset to use. Should include a `eval_script` column specifying the evaluation script for each instance", rich_help_panel="Data selection"),
+    split: str = typer.Option("dev", "--split", help="Dataset split", rich_help_panel="Data selection"),
+    slice_spec: str = typer.Option("", "--slice", help="Slice specification (e.g., '0:5' for first 5 instances)", rich_help_panel="Data selection"),
+    filter_spec: str = typer.Option("", "--filter", help="Filter instance IDs by regex", rich_help_panel="Data selection"),
+    shuffle: bool = typer.Option(False, "--shuffle", help="Shuffle instances", rich_help_panel="Data selection"),
+    output: str = typer.Option("", "-o", "--output", help="Output directory. Should contain a preds.json file with model predictions", rich_help_panel="Basic"),
+    workers: int = typer.Option(1, "-w", "--workers", help="Number of worker threads for parallel processing", rich_help_panel="Basic"),
+    model: str | None = typer.Option(None, "-m", "--model", help="Model to use", rich_help_panel="Basic"),
+    model_class: str | None = typer.Option(None, "-c", "--model-class", help="Model class to use (e.g., 'anthropic' or 'minisweagent.models.anthropic.AnthropicModel')", rich_help_panel="Advanced"),
+    redo_existing: bool = typer.Option(False, "--redo-existing", help="Redo existing instances", rich_help_panel="Data selection"),
+    config_spec: Path = typer.Option( builtin_config_dir / "extra" / "swebench.yaml", "-c", "--config", help="Path to a config file", rich_help_panel="Basic"),
+    environment_class: str | None = typer.Option( None, "--environment-class", help="Environment type to use. Recommended are docker or singularity", rich_help_panel="Advanced"),
+) -> None:
+    # fmt: on
+    output_path = Path(output)
+
+    predictions_file = output_path / "preds.json"
+    if not predictions_file.exists():
+        raise FileNotFoundError(f"Expected a `preds.json` file in output directory {output_path}")
+    logger.info(f"Results will be saved to {output_path}")
+    add_file_handler(output_path / "minisweagent_eval.log")
+
+    logger.info(f"Loading dataset {dataset}, split {split}...")
+    instances = list(load_dataset(dataset, split=split))
+
+    instances = filter_instances(instances, filter_spec=filter_spec, slice_spec=slice_spec, shuffle=shuffle)
+    if not redo_existing and (output_path / "evals.json").exists():
+        existing_instances = list(json.loads((output_path / "evals.json").read_text()).keys())
+        logger.info(f"Skipping {len(existing_instances)} existing instances")
+        instances = [instance for instance in instances if instance["instance_id"] not in existing_instances]
+    logger.info(f"Running on {len(instances)} instances...")
+
+    with open(output_path / "preds.json") as f:
+        predictions = json.load(f)
+
+    config = yaml.safe_load(get_config_path(config_spec).read_text())
+    if environment_class is not None:
+        config.setdefault("environment", {})["environment_class"] = environment_class
+    if model is not None:
+        config.setdefault("model", {})["model_name"] = model
+    if model_class is not None:
+        config.setdefault("model", {})["model_class"] = model_class
+
+    progress_manager = RunBatchProgressManager(len(instances), output_path / f"exit_statuses_eval_{time.time()}.yaml")
+
+    def process_futures(futures: dict[concurrent.futures.Future, str]):
+        for future in concurrent.futures.as_completed(futures):
+            try:
+                future.result()
+            except concurrent.futures.CancelledError:
+                pass
+            except Exception as e:
+                instance_id = futures[future]
+                logger.error(f"Error in future for instance {instance_id}: {e}", exc_info=True)
+                progress_manager.on_uncaught_exception(instance_id, e)
+
+    with Live(progress_manager.render_group, refresh_per_second=4):
+        with concurrent.futures.ThreadPoolExecutor(max_workers=workers) as executor:
+            futures = {
+                executor.submit(evaluate_instance, instance, output_path, predictions[instance["instance_id"]], config): instance[
+                    "instance_id"
+                ]
+                for instance in instances
+            }
+            try:
+                process_futures(futures)
+            except KeyboardInterrupt:
+                logger.info("Cancelling all pending jobs. Press ^C again to exit immediately.")
+                for future in futures:
+                    if not future.running() and not future.done():
+                        future.cancel()
+                process_futures(futures)
+
+
+if __name__ == "__main__":
+    app()
diff --git a/src/minisweagent/run/mini_extra.py b/src/minisweagent/run/mini_extra.py
@@ -9,8 +9,13 @@
     ("minisweagent.run.extra.config", ["config"], "Manage the global config file"),
     ("minisweagent.run.inspector", ["inspect", "i", "inspector"], "Run inspector (browse trajectories)"),
     ("minisweagent.run.github_issue", ["github-issue", "gh"], "Run on a GitHub issue"),
-    ("minisweagent.run.extra.swebench", ["swebench"], "Evaluate on SWE-bench (batch mode)"),
-    ("minisweagent.run.extra.swebench_single", ["swebench-single"], "Evaluate on SWE-bench (single instance)"),
+    ("minisweagent.run.extra.swebench", ["swebench"], "Generate trajectories for SWE-bench (batch mode)"),
+    (
+        "minisweagent.run.extra.swebench_single",
+        ["swebench-single"],
+        "Generate trajectory for SWE-bench (single instance)",
+    ),
+    ("minisweagent.run.extra.swebench_eval", ["swebench-eval"], "Evaluate trajectories on SWE-bench (batch mode)"),
 ]
 
 

diff --git a/tests/run/test_cli_integration.py b/tests/run/test_cli_integration.py
@@ -491,6 +491,7 @@ def test_mini_e_help():
         ("github-issue", ["github-issue", "gh"]),
         ("swebench", ["swebench"]),
         ("swebench-single", ["swebench-single"]),
+        ("swebench-eval", ["swebench-eval"]),
     ],
 )
 def test_mini_extra_subcommand_help(subcommand: str, aliases: list[str]):