-
-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add swe bench / swe extra setup scripts #424
base: master
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks good to me! Reviewed everything up to 49cc0d7 in 1 minute and 26 seconds
More details
- Looked at
771
lines of code in5
files - Skipped
0
files when reviewing. - Skipped posting
8
drafted comments based on config settings.
1. gptme/eval/swe_extra/swe_bench_test_spec.py:128
- Draft comment:
Usingsubprocess.run
withshell=True
can be a security risk. Consider using a list of arguments and avoidshell=True
if possible. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
While using shell=True can indeed be a security risk, this code is part of a test harness that already needs to execute arbitrary shell commands to run tests. The scripts being run are constructed from trusted sources (git commands, test frameworks, etc). Converting these to use argument lists would be complex and wouldn't meaningfully improve security since the code needs to execute shell commands anyway. The comment is technically correct but not practically useful in this context.
I might be underestimating the security risk. Even in a test environment, command injection could potentially be dangerous if the input data is not fully trusted.
The code is specifically designed to run arbitrary test commands in a controlled environment. The security risk from shell=True is minimal compared to the inherent risk of running tests, and making this change would add complexity without meaningful benefit.
While technically correct, this comment suggests a change that would add complexity without meaningful security benefits in this specific context. The comment should be removed.
2. gptme/eval/swe_extra/swe_bench_test_spec.py:134
- Draft comment:
Usingsubprocess.run
withshell=True
can be a security risk. Consider using a list of arguments and avoidshell=True
if possible. - Reason this comment was not posted:
Marked as duplicate.
3. gptme/eval/swe_extra/swe_bench_test_spec.py:140
- Draft comment:
Usingsubprocess.run
withshell=True
can be a security risk. Consider using a list of arguments and avoidshell=True
if possible. - Reason this comment was not posted:
Marked as duplicate.
4. gptme/eval/swe_extra/swe_bench_extra_data.py:115
- Draft comment:
Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors. - Reason this comment was not posted:
Confidence changes required:50%
The code usespandas.DataFrame.loc
to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
5. gptme/eval/swe_extra/swe_bench_extra_data.py:116
- Draft comment:
Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors. - Reason this comment was not posted:
Confidence changes required:50%
The code usespandas.DataFrame.loc
to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
6. gptme/eval/swe_extra/swe_bench_extra_data.py:117
- Draft comment:
Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors. - Reason this comment was not posted:
Confidence changes required:50%
The code usespandas.DataFrame.loc
to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
7. gptme/eval/swe_extra/swe_bench_extra_data.py:118
- Draft comment:
Ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors. - Reason this comment was not posted:
Confidence changes required:50%
The code usespandas.DataFrame.loc
to assign new columns, which is correct. However, it is important to ensure that the lengths of the lists being assigned match the number of rows in the DataFrame to avoid potential errors.
8. gptme/eval/swe_extra/swe_bench_test_spec.py:37
- Draft comment:
Usingos.path.join
for URLs can lead to incorrect URL formation. Consider using string concatenation orurllib.parse.urljoin
for URLs. - Reason this comment was not posted:
Decided after close inspection that this draft comment was likely wrong and/or not actionable:
While technically correct that os.path.join isn't ideal for URLs (especially on Windows where it uses backslashes), in this specific case it's likely working fine since:
- The URLs are GitHub raw URLs which use forward slashes
- The code is clearly working as evidenced by the requests.get() calls
- The paths being joined are simple segments without special characters
- The code is part of a test harness, not production code
The comment raises a valid technical point - os.path.join can cause issues with URLs, especially on Windows systems. This could potentially break in some environments.
While technically correct, the current implementation appears to be working fine for its purpose in a test harness. The benefit of changing it doesn't outweigh the cost given the low risk in this specific context.
Delete the comment. While technically correct, the current implementation is working fine for its purpose and the suggested change would add complexity without significant benefit in this context.
Workflow ID: wflow_XyI8KJdp6my1PswB
You can customize Ellipsis with 👍 / 👎 feedback, review rules, user-specific overrides, quiet
mode, and more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks nice, want to get this semi-working, then merged in some shape (doesn't have to be perfect), and then I'll refine it from there in a new PR.
Could you provide the missing things? :)
from pathlib import Path | ||
import glob | ||
import os | ||
from gptme.eval.agents.swebench import SWEBenchAgent |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can I see this plz? :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was quite janky lol
import json
import logging
import os
from pathlib import Path
import time
import uuid
from gptme.cli import get_name
from gptme.dirs import get_logs_dir
from gptme.eval.agents.fix import Fix
from gptme.eval.agents.reproduce import Reproduce
from gptme.eval.agents.understand import Understand
from gptme.eval.swe_extra.swe_bench_test_spec import instance_to_trajectory_info, make_test_spec
from gptme.llm import set_stop
from gptme.logmanager import LogManager, SWEBenchInfo
from gptme.message import print_msg
from gptme.tools import execute_msg, init_tools
from gptme.tools.read import reset_file_read_cache, save_file_read_cache
from swebench.harness.constants import SWEbenchInstance
logger = logging.getLogger(__name__)
class SWEBenchAgent:
stages = ["understand", "reproduce", "fix"]
def act(
self,
model: str,
instance: SWEbenchInstance,
repo_dir: str,
log_dir: str,
resume: bool = False,
start_stage: str = "understand",
**kwargs
):
# Initialize or load trajectory info
trajectory_info = instance_to_trajectory_info(
instance,
model,
repo_dir=repo_dir,
log_dir=log_dir if resume else None
)
if not resume:
trajectory_info.save_to_log_dir(log_dir)
# Understand
if self.stages.index(start_stage) <= self.stages.index("understand"):
Understand().act(model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, info=trajectory_info, **kwargs.get("understand", {}))
set_stop(["</planning>"])
# Reproduce
if self.stages.index(start_stage) <= self.stages.index("reproduce"):
Reproduce().act(model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, info=trajectory_info, **kwargs.get("reproduce", {}))
set_stop(["</planning>"])
# Fix
if self.stages.index(start_stage) <= self.stages.index("fix"):
Fix().act(model=model, instance=instance, repo_dir=repo_dir, log_dir=log_dir, info=trajectory_info, **kwargs.get("fix", {}))
# reset_file_read_cache() # maybe remove
return trajectory_info.artifacts
def get_resume_stage(self, log_dir: str) -> str:
understand_manager = LogManager.load(log_dir, lock=False, create=True, branch="understand")
reproduce_manager = LogManager.load(log_dir, lock=False, create=True, branch="reproduce")
fix_manager = LogManager.load(log_dir, lock=False, create=True, branch="fix")
if not understand_manager.log.messages:
return "understand"
elif not reproduce_manager.log.messages:
return "reproduce"
elif not fix_manager.log.messages:
return "fix"
return "understand"
def replay(self, log_dir: str):
logger.info(f"Replaying from log directory: {log_dir}")
info = SWEBenchInfo.load_from_log_dir(log_dir)
os.chdir(info.repo_dir)
init_tools()
understand_manager = LogManager.load(log_dir, lock=False, create=True, branch="understand")
reproduce_manager = LogManager.load(log_dir, lock=False, create=True, branch="reproduce")
fix_manager = LogManager.load(log_dir, lock=False, create=True, branch="fix")
for msg in understand_manager.log.messages:
if msg.role == "assistant":
for reply_msg in execute_msg(msg, lambda _: True):
print_msg(reply_msg, oneline=False)
files = {}
save_file_read_cache(ignore_files=["understanding.md", "read_cache.json"])
read_file_json = Path(info.repo_dir) / "read_cache.json"
with open(read_file_json, "r") as f: files.update({"read_cache.json": json.load(f)})
info.artifacts.update(files)
info.save_to_log_dir(log_dir)
for msg in reproduce_manager.log.messages:
if msg.role == "assistant":
for reply_msg in execute_msg(msg, lambda _: True):
print_msg(reply_msg, oneline=False)
for msg in fix_manager.log.messages:
if msg.role == "assistant":
for reply_msg in execute_msg(msg, lambda _: True):
print_msg(reply_msg, oneline=False)
def evaluate_instance(
self,
instance: SWEbenchInstance,
model: str = "openrouter/qwen/qwen-2.5-coder-32b-instruct",
resume_dir: Path | None = None,
**kwargs
):
instance_id = instance["instance_id"]
problem_statement = instance["problem_statement"]
info = SWEBenchInfo.load_from_log_dir(resume_dir) if resume_dir else None
if resume_dir and not info: raise ValueError(f"No info found in {resume_dir}")
test_spec = make_test_spec(instance, info.repo_dir if info else None)
logger.info(f"Evaluating instance: {instance_id}")
logger.debug(f"Problem statement: {problem_statement}")
if resume_dir:
log_dir = resume_dir
logger.info(f"Resuming from log directory: {log_dir}")
test_spec.reset_repo()
self.replay(log_dir)
repo_dir = info.repo_dir
else:
_id = uuid.uuid4().hex[:8]
name = get_name(f"gptme-evals-{model.replace('/', '--')}-{_id}")
log_dir = get_logs_dir() / name
repo_dir = test_spec.setup_repo()
start_time = time.time()
try:
logger.info(f"Executing agent for instance {instance_id}")
logger.info(f"Setting up repo for instance {instance_id}")
logger.info(f"Finished setting up repo for instance {instance_id} {repo_dir}")
SWEBenchAgent().act(
model=model,
instance=instance,
repo_dir=repo_dir,
log_dir=log_dir,
resume=bool(resume_dir),
start_stage=self.get_resume_stage(log_dir) if resume_dir else "understand",
**kwargs
)
gen_time = time.time() - start_time
logger.info(
f"Agent execution completed for instance {instance_id} in {gen_time:.2f} seconds"
)
passed = test_spec.eval_repo()
logger.info(f"Evaluation completed for instance {instance_id}. Passed: {passed}")
except Exception as e:
import traceback
logger.error(f"Error during agent execution for instance {instance_id}: {e}\n{''.join(traceback.format_tb(e.__traceback__))}")
from gptme.eval.agents.swebench import SWEBenchAgent | ||
from gptme.eval.swe_extra.swe_bench_extra_data import load_instance_by_id, load_top_50_easiest_task_instances | ||
from gptme.dirs import get_logs_dir | ||
from gptme.logmanager import LogManager, SWEBenchInfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And SWEBenchInfo
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dataclass(frozen=True)
class SWEBenchInfo:
instance_id: str
model_name: str
target: bool # Changed from int to bool to match dataset format
exit_status: str | None = None
generated_patch: str | None = None
eval_logs: str | None = None
artifacts: dict[str, str] = field(default_factory=dict)
timestamp: datetime = field(default_factory=datetime.now)
repo_dir: str | None = None
def to_dict(self) -> dict:
"""Convert to dictionary format matching SWE-bench dataset."""
return {
"instance_id": self.instance_id,
"model_name": self.model_name,
"target": self.target,
"exit_status": self.exit_status,
"generated_patch": self.generated_patch,
"eval_logs": self.eval_logs,
"timestamp": self.timestamp.isoformat(),
"artifacts": self.artifacts,
"repo_dir": self.repo_dir,
}
@classmethod
def from_dict(cls, d: dict) -> "SWEBenchInfo":
"""Create from dictionary, handling optional fields."""
# Convert timestamp string to datetime if present
if "timestamp" in d:
d = d.copy() # Make a copy to avoid modifying input
d["timestamp"] = datetime.fromisoformat(d["timestamp"])
return cls(**d)
@classmethod
def load_from_log_dir(cls, log_dir: PathLike) -> Optional["SWEBenchInfo"]:
log_dir = Path(log_dir)
swe_bench_info_file = log_dir / "swe_bench_info.json"
if not swe_bench_info_file.exists(): return None
with swe_bench_info_file.open() as f:
return cls.from_dict(json.load(f))
def save_to_log_dir(self, log_dir: PathLike) -> None:
log_dir = Path(log_dir)
swe_bench_info_file = log_dir / "swe_bench_info.json"
swe_bench_info_file.parent.mkdir(parents=True, exist_ok=True)
json.dump(self.to_dict(), swe_bench_info_file.open("w"), indent=2)
I was using the branch system to save each step in the understand, reproduce, fix cycle separately so I could restore from checkpoints |
@bjsi Nice! Was the branch system a mess or did it make sense to you/work well? |
@ErikBjare I didn't do extensive testing, but it seemed to work pretty well. I didn't notice any bugs. |
I extracted all of the changes from my branch and dumped them in this folder, it will still require a bit of work to integrate into the main branch, but the challenging part of installing requirements and deps should be taken care of.
Important
Add SWE-bench evaluation setup with scripts for running, data handling, and test specifications in
gptme/eval/swe_extra
.run_swe_extra.py
to execute SWE-bench evaluations with options to resume from previous runs and clear branches.swe_bench_extra_data.py
for loading and analyzing SWE-bench task instances and trajectories.swe_bench_extra_data.py
.swe_bench_constants.py
for requirements paths, test frameworks, and version installations.swe_bench_test_spec.py
to create and manage test specifications for SWE-bench instances.swe_bench_test_spec.py
.This description was created by for 49cc0d7. It will automatically update as commits are pushed.