add swe bench / swe extra setup scripts #424

bjsi · 2025-01-26T08:03:01Z

I extracted all of the changes from my branch and dumped them in this folder, it will still require a bit of work to integrate into the main branch, but the challenging part of installing requirements and deps should be taken care of.

Important

Add SWE-bench evaluation setup with scripts for running, data handling, and test specifications in gptme/eval/swe_extra.

Behavior:
- Adds run_swe_extra.py to execute SWE-bench evaluations with options to resume from previous runs and clear branches.
- Introduces swe_bench_extra_data.py for loading and analyzing SWE-bench task instances and trajectories.
- Implements caching for top 50 easiest task instances in swe_bench_extra_data.py.
Constants and Configurations:
- Defines placeholders and default mappings in swe_bench_constants.py for requirements paths, test frameworks, and version installations.
Test Specifications:
- Adds swe_bench_test_spec.py to create and manage test specifications for SWE-bench instances.
- Provides functions to generate setup and evaluation scripts for repositories in swe_bench_test_spec.py.

^{This description was created by}^{for 49cc0d7. It will automatically update as commits are pushed.}

ellipsis-dev

👍 Looks good to me! Reviewed everything up to 49cc0d7 in 1 minute and 26 seconds

More details

Looked at 771 lines of code in 5 files
Skipped 0 files when reviewing.
Skipped posting 8 drafted comments based on config settings.

1. gptme/eval/swe_extra/swe_bench_test_spec.py:128

Draft comment:
Using subprocess.run with shell=True can be a security risk. Consider using a list of arguments and avoid shell=True if possible.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
While using shell=True can indeed be a security risk, this code is part of a test harness that already needs to execute arbitrary shell commands to run tests. The scripts being run are constructed from trusted sources (git commands, test frameworks, etc). Converting these to use argument lists would be complex and wouldn't meaningfully improve security since the code needs to execute shell commands anyway. The comment is technically correct but not practically useful in this context.
I might be underestimating the security risk. Even in a test environment, command injection could potentially be dangerous if the input data is not fully trusted.
The code is specifically designed to run arbitrary test commands in a controlled environment. The security risk from shell=True is minimal compared to the inherent risk of running tests, and making this change would add complexity without meaningful benefit.
While technically correct, this comment suggests a change that would add complexity without meaningful security benefits in this specific context. The comment should be removed.

2. gptme/eval/swe_extra/swe_bench_test_spec.py:134

Draft comment:
Using subprocess.run with shell=True can be a security risk. Consider using a list of arguments and avoid shell=True if possible.
Reason this comment was not posted:
Marked as duplicate.

3. gptme/eval/swe_extra/swe_bench_test_spec.py:140

Draft comment:
Using subprocess.run with shell=True can be a security risk. Consider using a list of arguments and avoid shell=True if possible.
Reason this comment was not posted:
Marked as duplicate.

4. gptme/eval/swe_extra/swe_bench_extra_data.py:115

Draft comment:
Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
Reason this comment was not posted:
Confidence changes required: 50%
The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.

5. gptme/eval/swe_extra/swe_bench_extra_data.py:116

Draft comment:
Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
Reason this comment was not posted:
Confidence changes required: 50%
The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.

6. gptme/eval/swe_extra/swe_bench_extra_data.py:117

Draft comment:
Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
Reason this comment was not posted:
Confidence changes required: 50%
The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.

7. gptme/eval/swe_extra/swe_bench_extra_data.py:118

Draft comment:
Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
Reason this comment was not posted:
Confidence changes required: 50%
The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.

8. gptme/eval/swe_extra/swe_bench_test_spec.py:37

Draft comment:
Using os.path.join for URLs can lead to incorrect URL formation. Consider using string concatenation or urllib.parse.urljoin for URLs.
Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
While technically correct that os.path.join isn't ideal for URLs (especially on Windows where it uses backslashes), in this specific case it's likely working fine since:

The URLs are GitHub raw URLs which use forward slashes
The code is clearly working as evidenced by the requests.get() calls
The paths being joined are simple segments without special characters
The code is part of a test harness, not production code
The comment raises a valid technical point - os.path.join can cause issues with URLs, especially on Windows systems. This could potentially break in some environments.
While technically correct, the current implementation appears to be working fine for its purpose in a test harness. The benefit of changing it doesn't outweigh the cost given the low risk in this specific context.
Delete the comment. While technically correct, the current implementation is working fine for its purpose and the suggested change would add complexity without significant benefit in this context.

Workflow ID: wflow_XyI8KJdp6my1PswB

You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

ErikBjare

Looks nice, want to get this semi-working, then merged in some shape (doesn't have to be perfect), and then I'll refine it from there in a new PR.

Could you provide the missing things? :)

ErikBjare · 2025-01-29T12:23:59Z

gptme/eval/swe_extra/run_swe_extra.py

+from pathlib import Path
+import glob
+import os
+from gptme.eval.agents.swebench import SWEBenchAgent


Can I see this plz? :)

It was quite janky lol

import json import logging import os from pathlib import Path import time import uuid from gptme.cli import get_name from gptme.dirs import get_logs_dir from gptme.eval.agents.fix import Fix from gptme.eval.agents.reproduce import Reproduce from gptme.eval.agents.understand import Understand from gptme.eval.swe_extra.swe_bench_test_spec import instance_to_trajectory_info, make_test_spec from gptme.llm import set_stop from gptme.logmanager import LogManager, SWEBenchInfo from gptme.message import print_msg from gptme.tools import execute_msg, init_tools from gptme.tools.read import reset_file_read_cache, save_file_read_cache from swebench.harness.constants import SWEbenchInstance logger = logging.getLogger(__name__) class SWEBenchAgent: stages = ["understand", "reproduce", "fix"] def act( self, model: str, instance: SWEbenchInstance, repo_dir: str, log_dir: str, resume: bool = False, start_stage: str = "understand", **kwargs ): # Initialize or load trajectory info trajectory_info = instance_to_trajectory_info( instance, model, repo_dir=repo_dir, log_dir=log_dir if resume else None ) if not resume: trajectory_info.save_to_log_dir(log_dir) # Understand if self.stages.index(start_stage) <= self.stages.index("understand"): Understand().act(model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, info=trajectory_info, **kwargs.get("understand", {})) set_stop(["</planning>"]) # Reproduce if self.stages.index(start_stage) <= self.stages.index("reproduce"): Reproduce().act(model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, info=trajectory_info, **kwargs.get("reproduce", {})) set_stop(["</planning>"]) # Fix if self.stages.index(start_stage) <= self.stages.index("fix"): Fix().act(model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, info=trajectory_info, **kwargs.get("fix", {})) # reset_file_read_cache() # maybe remove return trajectory_info.artifacts def get_resume_stage(self, log_dir: str) -> str: understand_manager = LogManager.load(log_dir, lock=False, create=True, branch="understand") reproduce_manager = LogManager.load(log_dir, lock=False, create=True, branch="reproduce") fix_manager = LogManager.load(log_dir, lock=False, create=True, branch="fix") if not understand_manager.log.messages: return "understand" elif not reproduce_manager.log.messages: return "reproduce" elif not fix_manager.log.messages: return "fix" return "understand" def replay(self, log_dir: str): logger.info(f"Replaying from log directory: {log_dir}") info = SWEBenchInfo.load_from_log_dir(log_dir) os.chdir(info.repo_dir) init_tools() understand_manager = LogManager.load(log_dir, lock=False, create=True, branch="understand") reproduce_manager = LogManager.load(log_dir, lock=False, create=True, branch="reproduce") fix_manager = LogManager.load(log_dir, lock=False, create=True, branch="fix") for msg in understand_manager.log.messages: if msg.role == "assistant": for reply_msg in execute_msg(msg, lambda _: True): print_msg(reply_msg, oneline=False) files = {} save_file_read_cache(ignore_files=["understanding.md", "read_cache.json"]) read_file_json = Path(info.repo_dir) / "read_cache.json" with open(read_file_json, "r") as f: files.update({"read_cache.json": json.load(f)}) info.artifacts.update(files) info.save_to_log_dir(log_dir) for msg in reproduce_manager.log.messages: if msg.role == "assistant": for reply_msg in execute_msg(msg, lambda _: True): print_msg(reply_msg, oneline=False) for msg in fix_manager.log.messages: if msg.role == "assistant": for reply_msg in execute_msg(msg, lambda _: True): print_msg(reply_msg, oneline=False) def evaluate_instance( self, instance: SWEbenchInstance, model: str = "openrouter/qwen/qwen-2.5-coder-32b-instruct", resume_dir: Path | None = None, **kwargs ): instance_id = instance["instance_id"] problem_statement = instance["problem_statement"] info = SWEBenchInfo.load_from_log_dir(resume_dir) if resume_dir else None if resume_dir and not info: raise ValueError(f"No info found in {resume_dir}") test_spec = make_test_spec(instance, info.repo_dir if info else None) logger.info(f"Evaluating instance: {instance_id}") logger.debug(f"Problem statement: {problem_statement}") if resume_dir: log_dir = resume_dir logger.info(f"Resuming from log directory: {log_dir}") test_spec.reset_repo() self.replay(log_dir) repo_dir = info.repo_dir else: _id = uuid.uuid4().hex[:8] name = get_name(f"gptme-evals-{model.replace('/', '--')}-{_id}") log_dir = get_logs_dir() / name repo_dir = test_spec.setup_repo() start_time = time.time() try: logger.info(f"Executing agent for instance {instance_id}") logger.info(f"Setting up repo for instance {instance_id}") logger.info(f"Finished setting up repo for instance {instance_id} {repo_dir}") SWEBenchAgent().act( model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, resume=bool(resume_dir), start_stage=self.get_resume_stage(log_dir) if resume_dir else "understand", **kwargs ) gen_time = time.time() - start_time logger.info( f"Agent execution completed for instance {instance_id} in {gen_time:.2f} seconds" ) passed = test_spec.eval_repo() logger.info(f"Evaluation completed for instance {instance_id}. Passed: {passed}") except Exception as e: import traceback logger.error(f"Error during agent execution for instance {instance_id}: {e}\n{''.join(traceback.format_tb(e.__traceback__))}")

ErikBjare · 2025-01-29T12:24:13Z

gptme/eval/swe_extra/run_swe_extra.py

+from gptme.eval.agents.swebench import SWEBenchAgent
+from gptme.eval.swe_extra.swe_bench_extra_data import load_instance_by_id, load_top_50_easiest_task_instances
+from gptme.dirs import get_logs_dir
+from gptme.logmanager import LogManager, SWEBenchInfo


And SWEBenchInfo?

@dataclass(frozen=True) class SWEBenchInfo: instance_id: str model_name: str target: bool # Changed from int to bool to match dataset format exit_status: str | None = None generated_patch: str | None = None eval_logs: str | None = None artifacts: dict[str, str] = field(default_factory=dict) timestamp: datetime = field(default_factory=datetime.now) repo_dir: str | None = None def to_dict(self) -> dict: """Convert to dictionary format matching SWE-bench dataset.""" return { "instance_id": self.instance_id, "model_name": self.model_name, "target": self.target, "exit_status": self.exit_status, "generated_patch": self.generated_patch, "eval_logs": self.eval_logs, "timestamp": self.timestamp.isoformat(), "artifacts": self.artifacts, "repo_dir": self.repo_dir, } @classmethod def from_dict(cls, d: dict) -> "SWEBenchInfo": """Create from dictionary, handling optional fields.""" # Convert timestamp string to datetime if present if "timestamp" in d: d = d.copy() # Make a copy to avoid modifying input d["timestamp"] = datetime.fromisoformat(d["timestamp"]) return cls(**d) @classmethod def load_from_log_dir(cls, log_dir: PathLike) -> Optional["SWEBenchInfo"]: log_dir = Path(log_dir) swe_bench_info_file = log_dir / "swe_bench_info.json" if not swe_bench_info_file.exists(): return None with swe_bench_info_file.open() as f: return cls.from_dict(json.load(f)) def save_to_log_dir(self, log_dir: PathLike) -> None: log_dir = Path(log_dir) swe_bench_info_file = log_dir / "swe_bench_info.json" swe_bench_info_file.parent.mkdir(parents=True, exist_ok=True) json.dump(self.to_dict(), swe_bench_info_file.open("w"), indent=2)

bjsi · 2025-01-29T13:57:23Z

I was using the branch system to save each step in the understand, reproduce, fix cycle separately so I could restore from checkpoints

ErikBjare · 2025-01-29T18:25:21Z

@bjsi Nice! Was the branch system a mess or did it make sense to you/work well?

bjsi · 2025-01-29T18:54:46Z

@ErikBjare I didn't do extensive testing, but it seemed to work pretty well. I didn't notice any bugs.

add swe bench / swe extra setup scripts

49cc0d7

ellipsis-dev bot reviewed Jan 26, 2025

View reviewed changes

ErikBjare reviewed Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add swe bench / swe extra setup scripts #424

add swe bench / swe extra setup scripts #424

bjsi commented Jan 26, 2025 •

edited by ellipsis-dev bot

Loading

ellipsis-dev bot left a comment

ErikBjare left a comment

ErikBjare Jan 29, 2025

bjsi Jan 29, 2025

ErikBjare Jan 29, 2025

bjsi Jan 29, 2025

bjsi commented Jan 29, 2025

ErikBjare commented Jan 29, 2025

bjsi commented Jan 29, 2025

add swe bench / swe extra setup scripts #424

Are you sure you want to change the base?

add swe bench / swe extra setup scripts #424

Conversation

bjsi commented Jan 26, 2025 • edited by ellipsis-dev bot Loading

ellipsis-dev bot left a comment

Choose a reason for hiding this comment

ErikBjare left a comment

Choose a reason for hiding this comment

ErikBjare Jan 29, 2025

Choose a reason for hiding this comment

bjsi Jan 29, 2025

Choose a reason for hiding this comment

ErikBjare Jan 29, 2025

Choose a reason for hiding this comment

bjsi Jan 29, 2025

Choose a reason for hiding this comment

bjsi commented Jan 29, 2025

ErikBjare commented Jan 29, 2025

bjsi commented Jan 29, 2025

bjsi commented Jan 26, 2025 •

edited by ellipsis-dev bot

Loading