Skip to content

Commit c876c44

Browse files
committed
initial swe agent
1 parent 5401eda commit c876c44

File tree

9 files changed

+2137
-0
lines changed

9 files changed

+2137
-0
lines changed
Lines changed: 145 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,145 @@
1+
# SWE Benchmark Agent
2+
3+
## Overview
4+
5+
This agent is designed to tackle software engineering problems from two prominent benchmarks: SWE-bench and TerminalBench.
6+
7+
## Agent Details
8+
9+
| Feature | Description |
10+
| --- | --- |
11+
| **Interaction Type** | Autonomous |
12+
| **Complexity** | Advanced |
13+
| **Agent Type** | Single Agent |
14+
| **Components** | Tools: Shell |
15+
| **Vertical** | Software Engineering |
16+
17+
### Agent architecture:
18+
19+
The SWE Benchmark Agent uses a sophisticated orchestrator pattern:
20+
- **Orchestrator**: Manages the agent lifecycle and coordinates tool execution
21+
- **Environment**: Docker-based isolated execution environment (SWEBenchEnvironment or TerminalBenchEnvironment)
22+
- **Tools**: File operations (read, edit, create), shell commands, and submission
23+
- **Agent**: LLM-powered agent (Gemini) with built-in planner and thinking capabilities
24+
25+
The agent operates autonomously within the Docker environment, using shell commands and file operations to solve software engineering tasks.
26+
27+
## Setup and Installation
28+
29+
1. **Prerequisites**
30+
31+
* Python 3.10+
32+
* uv
33+
* For dependency management and packaging. Please follow the
34+
instructions on the official
35+
[uv website](https://docs.astral.sh/uv/) for installation.
36+
37+
```bash
38+
curl -LsSf https://astral.sh/uv/install.sh | sh
39+
```
40+
41+
* A project on Google Cloud Platform
42+
* Google Cloud CLI
43+
* For installation, please follow the instruction on the official
44+
[Google Cloud website](https://cloud.google.com/sdk/docs/install).
45+
46+
2. **Installation**
47+
48+
```bash
49+
# Clone this repository.
50+
git clone https://github.com/google/adk-samples.git
51+
cd adk-samples/python/agents/swe-benchmark-agent
52+
# Install the package and dependencies.
53+
uv sync
54+
```
55+
56+
3. **Configuration**
57+
58+
* Set up Google Cloud credentials.
59+
60+
* You may set the following environment variables in your shell, or in
61+
a `.env` file instead.
62+
63+
```bash
64+
export GOOGLE_GENAI_USE_VERTEXAI=true
65+
export GOOGLE_CLOUD_PROJECT=<your-project-id>
66+
export GOOGLE_CLOUD_LOCATION=<your-project-location>
67+
```
68+
69+
70+
## Running Tests
71+
72+
For running tests and evaluation, install the extra dependencies:
73+
74+
```bash
75+
uv sync --dev
76+
```
77+
78+
Then the tests and evaluation can be run from the `swe-benchmark-agent` directory using
79+
the `pytest` module:
80+
81+
```bash
82+
uv run pytest tests
83+
```
84+
85+
## Running Evaluations
86+
87+
The SWE Agent can be evaluated on both SWE-bench and TerminalBench benchmarks to measure its performance on real-world software engineering tasks.
88+
89+
### SWE-bench Evaluation
90+
91+
To run evaluation on the full SWE-bench Verified dataset:
92+
93+
```bash
94+
uv run python -m swe_benchmark_agent.main --full-dataset --evaluate --max-workers 4
95+
```
96+
97+
To evaluate on a specific number of instances (e.g., the first 10):
98+
99+
```bash
100+
uv run python -m swe_benchmark_agent.main --instance-id-or-count 10 --evaluate
101+
```
102+
103+
To evaluate on a single instance:
104+
105+
```bash
106+
uv run python -m swe_benchmark_agent.main --instance-id-or-count django__django-12345 --evaluate
107+
```
108+
109+
### TerminalBench Evaluation
110+
111+
To run evaluation on the full TerminalBench core dataset:
112+
113+
```bash
114+
uv run python -m swe_benchmark_agent.main --dataset terminalbench --full-dataset --evaluate --max-workers 4
115+
```
116+
117+
To evaluate on a specific number of tasks (e.g., the first 5):
118+
119+
```bash
120+
uv run python -m swe_benchmark_agent.main --dataset terminalbench --instance-id-or-count 5 --evaluate
121+
```
122+
123+
To evaluate on a single task:
124+
125+
```bash
126+
uv run python -m swe_benchmark_agent.main --dataset terminalbench --instance-id-or-count blind-maze-explorer-5x5 --evaluate
127+
```
128+
129+
### Evaluation Results
130+
131+
The following table shows the performance of different Gemini models on SWE-bench and TerminalBench:
132+
133+
| Model | SWEBench-Verified | TerminalBench |
134+
|-------|-------------------|---------------|
135+
| Gemini 2.5 Flash | 54% | 23.75% |
136+
| Gemini 2.5 Pro | 65.6% | 30% |
137+
| Gemini 2.5 Flash Preview (09/25) | 59% | 32.5% |
138+
139+
## Customization
140+
141+
The SWE Agent can be customized to better suit your requirements. For example:
142+
143+
1. **Use a different model:** You can change the model used by the agent by modifying the `main.py` file.
144+
2. **Add more tools:** You can add more tools to the agent to give it more capabilities.
145+
3. **Support more benchmarks:** You can add support for more benchmarks by creating a new environment and updating the `main.py` file.
Lines changed: 43 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
[project]
2+
name = "swe-benchmark-agent"
3+
version = "0.1.0"
4+
description = "A software engineering agent for SWE-bench and Terminal-Bench benchmarks using Google ADK"
5+
authors = [{ name = "Utsav Garg", email = "[email protected]" }]
6+
license = "Apache-2.0"
7+
readme = "README.md"
8+
requires-python = ">=3.10,<3.14"
9+
classifiers = [
10+
"Development Status :: 3 - Alpha",
11+
"Intended Audience :: Developers",
12+
"Programming Language :: Python :: 3.10",
13+
"Programming Language :: Python :: 3 :: Only",
14+
]
15+
16+
dependencies = [
17+
"swebench @ git+https://github.com/swe-bench/[email protected]#egg=swebench",
18+
"typer>=0.19.2",
19+
"datasets>=4.2.0",
20+
"jinja2>=3.1.5",
21+
"GitPython>=3.1.45",
22+
"docker>=7.1.0",
23+
"google-adk~=1.10.0",
24+
"pyyaml>=6.0.2",
25+
"python-dotenv>=1.0.1",
26+
]
27+
28+
[dependency-groups]
29+
dev = [
30+
"pytest>=8.4.2",
31+
"pytest-asyncio>=0.23.0",
32+
]
33+
34+
[build-system]
35+
requires = ["uv_build>=0.8.14,<0.9.0"]
36+
build-backend = "uv_build"
37+
38+
[tool.uv.build-backend]
39+
module-root = ""
40+
41+
[tool.pytest.ini_options]
42+
pythonpath = "."
43+
asyncio_default_fixture_loop_scope = "function"
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
# Copyright 2025 Google LLC
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
15+
"""SWE Agent - Software Engineering Agent for benchmark evaluation.
16+
17+
This package provides a sophisticated agent for solving software engineering
18+
tasks from SWE-bench and Terminal-Bench benchmarks using Google ADK.
19+
"""
20+
21+
__version__ = "0.1.0"

0 commit comments

Comments
 (0)