Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add swe bench / swe extra setup scripts #424

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

bjsi
Copy link
Contributor

@bjsi bjsi commented Jan 26, 2025

I extracted all of the changes from my branch and dumped them in this folder, it will still require a bit of work to integrate into the main branch, but the challenging part of installing requirements and deps should be taken care of.


Important

Add SWE-bench evaluation setup with scripts for running, data handling, and test specifications in gptme/eval/swe_extra.

  • Behavior:
    • Adds run_swe_extra.py to execute SWE-bench evaluations with options to resume from previous runs and clear branches.
    • Introduces swe_bench_extra_data.py for loading and analyzing SWE-bench task instances and trajectories.
    • Implements caching for top 50 easiest task instances in swe_bench_extra_data.py.
  • Constants and Configurations:
    • Defines placeholders and default mappings in swe_bench_constants.py for requirements paths, test frameworks, and version installations.
  • Test Specifications:
    • Adds swe_bench_test_spec.py to create and manage test specifications for SWE-bench instances.
    • Provides functions to generate setup and evaluation scripts for repositories in swe_bench_test_spec.py.

This description was created by Ellipsis for 49cc0d7. It will automatically update as commits are pushed.

Copy link
Contributor

@ellipsis-dev ellipsis-dev bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks good to me! Reviewed everything up to 49cc0d7 in 1 minute and 26 seconds

More details
  • Looked at 771 lines of code in 5 files
  • Skipped 0 files when reviewing.
  • Skipped posting 8 drafted comments based on config settings.
1. gptme/eval/swe_extra/swe_bench_test_spec.py:128
  • Draft comment:
    Using subprocess.run with shell=True can be a security risk. Consider using a list of arguments and avoid shell=True if possible.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    While using shell=True can indeed be a security risk, this code is part of a test harness that already needs to execute arbitrary shell commands to run tests. The scripts being run are constructed from trusted sources (git commands, test frameworks, etc). Converting these to use argument lists would be complex and wouldn't meaningfully improve security since the code needs to execute shell commands anyway. The comment is technically correct but not practically useful in this context.
    I might be underestimating the security risk. Even in a test environment, command injection could potentially be dangerous if the input data is not fully trusted.
    The code is specifically designed to run arbitrary test commands in a controlled environment. The security risk from shell=True is minimal compared to the inherent risk of running tests, and making this change would add complexity without meaningful benefit.
    While technically correct, this comment suggests a change that would add complexity without meaningful security benefits in this specific context. The comment should be removed.
2. gptme/eval/swe_extra/swe_bench_test_spec.py:134
  • Draft comment:
    Using subprocess.run with shell=True can be a security risk. Consider using a list of arguments and avoid shell=True if possible.
  • Reason this comment was not posted:
    Marked as duplicate.
3. gptme/eval/swe_extra/swe_bench_test_spec.py:140
  • Draft comment:
    Using subprocess.run with shell=True can be a security risk. Consider using a list of arguments and avoid shell=True if possible.
  • Reason this comment was not posted:
    Marked as duplicate.
4. gptme/eval/swe_extra/swe_bench_extra_data.py:115
  • Draft comment:
    Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
5. gptme/eval/swe_extra/swe_bench_extra_data.py:116
  • Draft comment:
    Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
6. gptme/eval/swe_extra/swe_bench_extra_data.py:117
  • Draft comment:
    Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
7. gptme/eval/swe_extra/swe_bench_extra_data.py:118
  • Draft comment:
    Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
  • Reason this comment was not posted:
    Confidence changes required: 50%
    The code uses pandas.DataFrame.loc to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
8. gptme/eval/swe_extra/swe_bench_test_spec.py:37
  • Draft comment:
    Using os.path.join for URLs can lead to incorrect URL formation. Consider using string concatenation or urllib.parse.urljoin for URLs.
  • Reason this comment was not posted:
    Decided after close inspection that this draft comment was likely wrong and/or not actionable:
    While technically correct that os.path.join isn't ideal for URLs (especially on Windows where it uses backslashes), in this specific case it's likely working fine since:
  1. The URLs are GitHub raw URLs which use forward slashes
  2. The code is clearly working as evidenced by the requests.get() calls
  3. The paths being joined are simple segments without special characters
  4. The code is part of a test harness, not production code
    The comment raises a valid technical point - os.path.join can cause issues with URLs, especially on Windows systems. This could potentially break in some environments.
    While technically correct, the current implementation appears to be working fine for its purpose in a test harness. The benefit of changing it doesn't outweigh the cost given the low risk in this specific context.
    Delete the comment. While technically correct, the current implementation is working fine for its purpose and the suggested change would add complexity without significant benefit in this context.

Workflow ID: wflow_XyI8KJdp6my1PswB


You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet mode, and more.

Copy link
Owner

@ErikBjare ErikBjare left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice, want to get this semi-working, then merged in some shape (doesn't have to be perfect), and then I'll refine it from there in a new PR.

Could you provide the missing things? :)

from pathlib import Path
import glob
import os
from gptme.eval.agents.swebench import SWEBenchAgent
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can I see this plz? :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was quite janky lol

import json
import logging
import os
from pathlib import Path
import time
import uuid

from gptme.cli import get_name
from gptme.dirs import get_logs_dir
from gptme.eval.agents.fix import Fix
from gptme.eval.agents.reproduce import Reproduce
from gptme.eval.agents.understand import Understand
from gptme.eval.swe_extra.swe_bench_test_spec import instance_to_trajectory_info, make_test_spec
from gptme.llm import set_stop
from gptme.logmanager import LogManager, SWEBenchInfo 
from gptme.message import print_msg
from gptme.tools import execute_msg, init_tools
from gptme.tools.read import reset_file_read_cache, save_file_read_cache
from swebench.harness.constants import SWEbenchInstance

logger = logging.getLogger(__name__)

class SWEBenchAgent:
    stages = ["understand", "reproduce", "fix"]
    def act(
        self,
        model: str,
        instance: SWEbenchInstance,
        repo_dir: str,
        log_dir: str,
        resume: bool = False,
        start_stage: str = "understand",
        **kwargs
    ):

        # Initialize or load trajectory info
        trajectory_info = instance_to_trajectory_info(
            instance, 
            model, 
            repo_dir=repo_dir,
            log_dir=log_dir if resume else None
        )
        
        if not resume:
            trajectory_info.save_to_log_dir(log_dir)
            
        # Understand
        if self.stages.index(start_stage) <= self.stages.index("understand"): 
            Understand().act(model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, info=trajectory_info, **kwargs.get("understand", {}))

        set_stop(["</planning>"])

        # Reproduce
        if self.stages.index(start_stage) <= self.stages.index("reproduce"):
            Reproduce().act(model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, info=trajectory_info, **kwargs.get("reproduce", {}))

        set_stop(["</planning>"])
        # Fix
        if self.stages.index(start_stage) <= self.stages.index("fix"):
            Fix().act(model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, info=trajectory_info, **kwargs.get("fix", {}))
            
        # reset_file_read_cache() # maybe remove
        return trajectory_info.artifacts
    
    def get_resume_stage(self, log_dir: str) -> str:
        understand_manager = LogManager.load(log_dir, lock=False, create=True, branch="understand")
        reproduce_manager = LogManager.load(log_dir, lock=False, create=True, branch="reproduce")
        fix_manager = LogManager.load(log_dir, lock=False, create=True, branch="fix")
        if not understand_manager.log.messages:
            return "understand"
        elif not reproduce_manager.log.messages:
            return "reproduce"
        elif not fix_manager.log.messages:
            return "fix"
        return "understand"
    
    def replay(self, log_dir: str):
        logger.info(f"Replaying from log directory: {log_dir}")
        info = SWEBenchInfo.load_from_log_dir(log_dir)
        os.chdir(info.repo_dir)
        init_tools()
        understand_manager = LogManager.load(log_dir, lock=False, create=True, branch="understand")
        reproduce_manager = LogManager.load(log_dir, lock=False, create=True, branch="reproduce")
        fix_manager = LogManager.load(log_dir, lock=False, create=True, branch="fix")
        for msg in understand_manager.log.messages:
            if msg.role == "assistant":
                for reply_msg in execute_msg(msg, lambda _: True):
                    print_msg(reply_msg, oneline=False)
        files = {}
        save_file_read_cache(ignore_files=["understanding.md", "read_cache.json"])
        read_file_json = Path(info.repo_dir) / "read_cache.json"
        with open(read_file_json, "r") as f: files.update({"read_cache.json": json.load(f)})
        info.artifacts.update(files)
        info.save_to_log_dir(log_dir)
        for msg in reproduce_manager.log.messages:
            if msg.role == "assistant":
                for reply_msg in execute_msg(msg, lambda _: True):
                    print_msg(reply_msg, oneline=False)
        for msg in fix_manager.log.messages:
            if msg.role == "assistant":
                for reply_msg in execute_msg(msg, lambda _: True):
                    print_msg(reply_msg, oneline=False)

    def evaluate_instance(
        self,
        instance: SWEbenchInstance,
        model: str = "openrouter/qwen/qwen-2.5-coder-32b-instruct",
        resume_dir: Path | None = None,
        **kwargs
    ):
        instance_id = instance["instance_id"]
        problem_statement = instance["problem_statement"]
        info = SWEBenchInfo.load_from_log_dir(resume_dir) if resume_dir else None
        if resume_dir and not info: raise ValueError(f"No info found in {resume_dir}")

        test_spec = make_test_spec(instance, info.repo_dir if info else None)

        logger.info(f"Evaluating instance: {instance_id}")
        logger.debug(f"Problem statement: {problem_statement}")

        if resume_dir:
            log_dir = resume_dir
            logger.info(f"Resuming from log directory: {log_dir}")
            test_spec.reset_repo()
            self.replay(log_dir)
            repo_dir = info.repo_dir
        else:
            _id = uuid.uuid4().hex[:8]
            name = get_name(f"gptme-evals-{model.replace('/', '--')}-{_id}")
            log_dir = get_logs_dir() / name
            repo_dir = test_spec.setup_repo()

        start_time = time.time()
        try:
            logger.info(f"Executing agent for instance {instance_id}")
            logger.info(f"Setting up repo for instance {instance_id}")
            logger.info(f"Finished setting up repo for instance {instance_id} {repo_dir}")
            
            SWEBenchAgent().act(
                model=model, 
                instance=instance, 
                repo_dir=repo_dir, 
                log_dir=log_dir,
                resume=bool(resume_dir),
                start_stage=self.get_resume_stage(log_dir) if resume_dir else "understand",
                **kwargs
            )
            
            gen_time = time.time() - start_time
            logger.info(
                f"Agent execution completed for instance {instance_id} in {gen_time:.2f} seconds"
            )
            passed = test_spec.eval_repo()
            logger.info(f"Evaluation completed for instance {instance_id}. Passed: {passed}")
        except Exception as e:
            import traceback
            logger.error(f"Error during agent execution for instance {instance_id}: {e}\n{''.join(traceback.format_tb(e.__traceback__))}")

from gptme.eval.agents.swebench import SWEBenchAgent
from gptme.eval.swe_extra.swe_bench_extra_data import load_instance_by_id, load_top_50_easiest_task_instances
from gptme.dirs import get_logs_dir
from gptme.logmanager import LogManager, SWEBenchInfo
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And SWEBenchInfo?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dataclass(frozen=True)
class SWEBenchInfo:
    instance_id: str
    model_name: str
    target: bool  # Changed from int to bool to match dataset format
    exit_status: str | None = None
    generated_patch: str | None = None
    eval_logs: str | None = None
    artifacts: dict[str, str] = field(default_factory=dict)
    timestamp: datetime = field(default_factory=datetime.now)
    repo_dir: str | None = None

    def to_dict(self) -> dict:
        """Convert to dictionary format matching SWE-bench dataset."""
        return {
            "instance_id": self.instance_id,
            "model_name": self.model_name,
            "target": self.target,
            "exit_status": self.exit_status,
            "generated_patch": self.generated_patch,
            "eval_logs": self.eval_logs,
            "timestamp": self.timestamp.isoformat(),
            "artifacts": self.artifacts,
            "repo_dir": self.repo_dir,
        }

    @classmethod
    def from_dict(cls, d: dict) -> "SWEBenchInfo":
        """Create from dictionary, handling optional fields."""
        # Convert timestamp string to datetime if present
        if "timestamp" in d:
            d = d.copy()  # Make a copy to avoid modifying input
            d["timestamp"] = datetime.fromisoformat(d["timestamp"])
        return cls(**d)
    
    @classmethod
    def load_from_log_dir(cls, log_dir: PathLike) -> Optional["SWEBenchInfo"]:
        log_dir = Path(log_dir)
        swe_bench_info_file = log_dir / "swe_bench_info.json"
        if not swe_bench_info_file.exists(): return None
        with swe_bench_info_file.open() as f:
            return cls.from_dict(json.load(f))
    
    def save_to_log_dir(self, log_dir: PathLike) -> None:
        log_dir = Path(log_dir)
        swe_bench_info_file = log_dir / "swe_bench_info.json"
        swe_bench_info_file.parent.mkdir(parents=True, exist_ok=True)
        json.dump(self.to_dict(), swe_bench_info_file.open("w"), indent=2)

@bjsi
Copy link
Contributor Author

bjsi commented Jan 29, 2025

I was using the branch system to save each step in the understand, reproduce, fix cycle separately so I could restore from checkpoints

@ErikBjare
Copy link
Owner

@bjsi Nice! Was the branch system a mess or did it make sense to you/work well?

@bjsi
Copy link
Contributor Author

bjsi commented Jan 29, 2025

@ErikBjare I didn't do extensive testing, but it seemed to work pretty well. I didn't notice any bugs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants