-
Notifications
You must be signed in to change notification settings - Fork 28
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Spelunker] Setting up an evaluation framework for Spelunker (#741)
Now all I have to do is come up with 10 test questions and score each of them against each of the 426 chunks against it. :-)
- Loading branch information
1 parent
e7ab730
commit cb6ec74
Showing
98 changed files
with
17,931 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
# Instructions for running evals | ||
|
||
These instructions are biased towards Linux. | ||
For Windows I may have to adjust some of the code | ||
(especially shell commands). | ||
|
||
For more about the design of evaluations, see `evals/design.md`. | ||
|
||
This uses the running example of using the dispatcher package | ||
as sample code. | ||
|
||
We use `evals/eval-1` as the directory to hold all eval data. | ||
(This happens to be the default built into the scripts.) | ||
|
||
## 1. Copy source files to EVALDIR (`evals/eval-1`) | ||
|
||
Assume the TypeAgent root is `~/TypeAgent`. Adjust to taste. | ||
|
||
```shell | ||
$ mkdir evals/eval-1 | ||
$ mkdir evals/eval-1/source | ||
$ cp -r ~/TypeAgent/ts/packages/dispatcher evals/eval-1/source/ | ||
$ rm -rf evals/eval-1/source/dispatcher/{dist,node_modules} | ||
$ rm -rf evals/eval-1/source/dispatcher/package.json | ||
$ | ||
``` | ||
|
||
We delete `dist` and `node_mpdules` to save space (Spelunker ignores them). | ||
We remove `package.json` since otherwise the Repo policy test fails. | ||
|
||
## 2. Run Spelunker over the copied sources | ||
|
||
NOTE: You must exit the CLI by hitting `^C`. | ||
|
||
```shell | ||
$ cd ~/TypeAgent/ts | ||
$ pnpm run cli interactive | ||
... | ||
{{...}}> @config request spelunker | ||
Natural langue request handling agent is set to 'spelunker' | ||
{{<> SPELUNKER}}> .focus ~/TypeAgent/ts/packages/agents/spelunker/evals/eval-1/source/dispatcher | ||
Focus set to /home/gvanrossum/TypeAgent/ts/packages/agents/spelunker/evals/eval-1/source/dispatcher | ||
{{<> SPELUNKER}}> Summarize the codebase | ||
... | ||
(an enormous grinding of gears) | ||
... | ||
The codebase consists of various TypeScript files that define interfaces, functions, classes, and tests for a system that handles commands, agents, and translations. Here is a summary of the key components: | ||
... | ||
References: ... | ||
{{<> SPELUNKER}}> ^C | ||
ELIFECYCLE Command failed with exit code 130. | ||
ELIFECYCLE Command failed with exit code 130. | ||
$ | ||
``` | ||
|
||
This leaves the data in the database `~/.typeagent/agents/spelunker/codeSearchDatabase.db` | ||
|
||
## 3. Initialize the eval database | ||
|
||
You can do this multiple times, but once you've started scoring (#4 below), | ||
it will erase the scores you've already entered. (TODO: preserve scores.) | ||
|
||
```shell | ||
$ python3 ./evals/src/evalsetup.py --overwrite | ||
Prefix: /home/gvanrossum/TypeAgent/ts/packages/agents/spelunker/evals/eval-1/source/ | ||
Database: evals/eval-1/eval.db | ||
... | ||
(More logging) | ||
... | ||
$ | ||
``` | ||
|
||
## 4. Run the manual scoring tool | ||
|
||
You have to score separately for every test question. | ||
I recommend no more than 10 test questions | ||
(you have to score 426 chunks for each question). | ||
|
||
```shell | ||
$ python3 evals/src/evalscore.py --question "Summarize the codebase" | ||
``` | ||
|
||
The scoring tool repeatedly presents you with a chunk of text, | ||
prompting you to enter `0` or `1` (or `y`/`n`) for each. | ||
|
||
Chunks are colorized and piped through `less`; | ||
if a chunk doesn't fit on the page you'll get to page through it. | ||
|
||
## 5. Run an evaluation | ||
|
||
TBD. | ||
|
||
## 6. Where everything is | ||
|
||
The `evals` directory under `spelunker` contains everything you need. | ||
|
||
- `evals/src` contains the tools to run. | ||
- Each separate evaluation (if you want to evaluate different codebases) | ||
lives in a separate subdirectory, e.g. `evals/eval-1`. | ||
- Under `evals/eval-N` you find: | ||
- `eval.db` is the eval database; it contains all the eval data. | ||
- Tables `Files`, `Chunks`, `Blobs` are literal copies of | ||
the corresponding tables from the database used by the agent. | ||
- `Hashes` contains a mapping from chunks to hashes. | ||
- `Questions` contains the test questions. | ||
- `Scores` contains the scores for each test question. | ||
- `source` contains the source code of the codebase; for example: | ||
- `source/dispatcher` contains the (frozen) `dispatcher` package. | ||
- `source/README.md` describes the origin of the codebase. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,130 @@ | ||
# Spelunker design notes | ||
|
||
# Evaluation design | ||
|
||
## Purpose of the eval | ||
|
||
We need to be able to compare different algorithms for selecting chunks | ||
for the oracle context. (For example, using a local fuzzy index based on | ||
embeddings, or sending everything to a cheap model to select; we can | ||
also evaluate prompt engineering attempts.) | ||
|
||
Evaluating the oracle is too complicated, we assume that the key to | ||
success is providing the right context. So we evaluate that. | ||
|
||
The proposed evaluation approach is thus: | ||
|
||
- Take a suitable not-too-large codebase and make a copy of it. | ||
(We don't want the sample code base to vary.) | ||
- Write some questions that are appropriate for the codebase. | ||
(This requires intuition, though an AI might help.) | ||
- Chunk the codebase (fortunately, chunking is deterministic.) | ||
- **Manually** review each chunk for each question, scoring yes/no. | ||
(Or maybe a refined scale like irrelevant, possibly relevant, | ||
likely relevant, or extremely relevant.) | ||
- Now, for each question: | ||
- Send it through the chosen selection process (this is variable). | ||
- Compare the selected chunks with the manual scores. | ||
- Produce an overall score from this. | ||
- Repeat the latter for different selection algorithms | ||
|
||
# Building the eval architecture | ||
|
||
- Sample codebase stored in test-data/sample-src/\*_/_.{py,ts} | ||
- Permanently chunked into database at test-data/evals.db | ||
- Same schema as main db | ||
- Don't bother with summaries (they're probably totally obsolete) | ||
- There's an extra table with identifying info about the sample | ||
codebase (so we can move the whole sqlite db elsewhere). | ||
- There's also an extra table giving the sample questions for this | ||
sample codebase. | ||
- Then there's a table giving the manual scores for each chunk | ||
and each sample question. (We may have to version this too somehow.) | ||
- Finally there's a table of eval runs, each identifying the algorithm, | ||
its version, head commit, start time, end time, | ||
whether completed without failure, and F1 score. | ||
(We keep all runs, for historical comparisons.) | ||
- We should add the hash of the chunk (including filename, | ||
but excluding chunk IDs) so we can re-chunk the code and not lose | ||
the already scored chunks. | ||
Should lineno be included in this table? Yes if we only expect changes | ||
to the chunking algorithm, no if we expect extensive edits to the | ||
sample codebase. | ||
- Must support multiple codebases, with separate lists of questions. | ||
(Or maybe just separate directories with their own database?) | ||
|
||
## Exact schema design | ||
|
||
For now we assume a single codebase (maybe mixed-language). | ||
|
||
We keep the tables `Files`, `Chunks` and `Blobs` **unchanged** from | ||
the full spelunker app. We run Spelunker with one question over our | ||
sample codebase and copy the database files to the eval directory. | ||
We then add more tables for the following purposes: | ||
|
||
- Eval info (start timestamp, end scoring timestamp, | ||
free format notes). | ||
- Table mapping hash to chunk ID. | ||
- Table of test questions, with unique ID (1, 2, etc.). | ||
- Table for manual scores: (question ID, chunk hash, score (0/1), timestamp) | ||
- Table for eval runs (schema TBD) | ||
|
||
### Hashing chunks | ||
|
||
We hash chunks so that we can keep the scores and eval run results | ||
even when the sample codebase is re-chunked for some reason | ||
(which assigns new chunks everywhere). | ||
|
||
However, sometimes unrelated hashes have the same text, e.g. | ||
|
||
```py | ||
def __len__(self): | ||
return len(self._data) | ||
``` | ||
|
||
occurring in multiple classes written using the same design pattern. | ||
|
||
How to disambiguate? I propose to include the names of the containing | ||
chunk(s), up to the module root, and also the full filename of the | ||
file where the chunk occurs (in case there are very similar files). | ||
|
||
So, for example, the input to the hash function could be: | ||
|
||
``` | ||
# <filename> | ||
# <great-grandparent> <grandparent> <parent> | ||
<text of chunk> | ||
``` | ||
|
||
MD5 is a fine, fast hash function, since we're not worried about crypto. | ||
|
||
## Manual scoring tool | ||
|
||
This is a tedious task, so want to make its UI ergonomic. | ||
Should it use a web UI or command line? | ||
|
||
Should be safe to stop and resume at that point later. | ||
|
||
Should be possible to remove certain entries to be redone. | ||
(Maybe just delete rows manually using sqlite3 cmd.) | ||
|
||
Basically in a loop: | ||
|
||
- If the chunk has already been scored for all questions, skip it. | ||
- Display the chunk (mostly from its blobs), probably using less | ||
- For each question that hasn't been scored yet: | ||
- Ask for yes/no, corresponding to "should it be included in the | ||
oracle context" | ||
- As soon as you answer it moves to the next chunk | ||
- Record scores in database immediately, with chunk hash and timestamp | ||
|
||
Once we've scored all chunks (hopefully not more than a few 100) | ||
we can move on to the next stage, running the evals: | ||
|
||
## Automatic eval runs | ||
|
||
TBD | ||
|
||
# Random notes | ||
|
||
Scoring tool needs more UI. Maybe use colors? (Code colorization, even?) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
# Source notes | ||
|
||
This is copied from `TypeAgent/ts/packages/dispatcher/`, | ||
with the `dist` and `node_modules` omitted. | ||
|
||
The exact revision used was e7ab730c. | ||
|
||
--Guido, 2/20/2025 |
21 changes: 21 additions & 0 deletions
21
ts/packages/agents/spelunker/evals/eval-1/source/dispatcher/LICENSE
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) Microsoft Corporation. | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE |
Oops, something went wrong.