Skip to content

Commit

Permalink
[Spelunker] Setting up an evaluation framework for Spelunker (#741)
Browse files Browse the repository at this point in the history
Now all I have to do is come up with 10 test questions and score each of
them against each of the 426 chunks against it. :-)
  • Loading branch information
gvanrossum-ms authored Feb 21, 2025
1 parent e7ab730 commit cb6ec74
Show file tree
Hide file tree
Showing 98 changed files with 17,931 additions and 0 deletions.
109 changes: 109 additions & 0 deletions ts/packages/agents/spelunker/evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# Instructions for running evals

These instructions are biased towards Linux.
For Windows I may have to adjust some of the code
(especially shell commands).

For more about the design of evaluations, see `evals/design.md`.

This uses the running example of using the dispatcher package
as sample code.

We use `evals/eval-1` as the directory to hold all eval data.
(This happens to be the default built into the scripts.)

## 1. Copy source files to EVALDIR (`evals/eval-1`)

Assume the TypeAgent root is `~/TypeAgent`. Adjust to taste.

```shell
$ mkdir evals/eval-1
$ mkdir evals/eval-1/source
$ cp -r ~/TypeAgent/ts/packages/dispatcher evals/eval-1/source/
$ rm -rf evals/eval-1/source/dispatcher/{dist,node_modules}
$ rm -rf evals/eval-1/source/dispatcher/package.json
$
```

We delete `dist` and `node_mpdules` to save space (Spelunker ignores them).
We remove `package.json` since otherwise the Repo policy test fails.

## 2. Run Spelunker over the copied sources

NOTE: You must exit the CLI by hitting `^C`.

```shell
$ cd ~/TypeAgent/ts
$ pnpm run cli interactive
...
{{...}}> @config request spelunker
Natural langue request handling agent is set to 'spelunker'
{{<> SPELUNKER}}> .focus ~/TypeAgent/ts/packages/agents/spelunker/evals/eval-1/source/dispatcher
Focus set to /home/gvanrossum/TypeAgent/ts/packages/agents/spelunker/evals/eval-1/source/dispatcher
{{<> SPELUNKER}}> Summarize the codebase
...
(an enormous grinding of gears)
...
The codebase consists of various TypeScript files that define interfaces, functions, classes, and tests for a system that handles commands, agents, and translations. Here is a summary of the key components:
...
References: ...
{{<> SPELUNKER}}> ^C
ELIFECYCLE  Command failed with exit code 130.
 ELIFECYCLE  Command failed with exit code 130.
$
```

This leaves the data in the database `~/.typeagent/agents/spelunker/codeSearchDatabase.db`

## 3. Initialize the eval database

You can do this multiple times, but once you've started scoring (#4 below),
it will erase the scores you've already entered. (TODO: preserve scores.)

```shell
$ python3 ./evals/src/evalsetup.py --overwrite
Prefix: /home/gvanrossum/TypeAgent/ts/packages/agents/spelunker/evals/eval-1/source/
Database: evals/eval-1/eval.db
...
(More logging)
...
$
```

## 4. Run the manual scoring tool

You have to score separately for every test question.
I recommend no more than 10 test questions
(you have to score 426 chunks for each question).

```shell
$ python3 evals/src/evalscore.py --question "Summarize the codebase"
```

The scoring tool repeatedly presents you with a chunk of text,
prompting you to enter `0` or `1` (or `y`/`n`) for each.

Chunks are colorized and piped through `less`;
if a chunk doesn't fit on the page you'll get to page through it.

## 5. Run an evaluation

TBD.

## 6. Where everything is

The `evals` directory under `spelunker` contains everything you need.

- `evals/src` contains the tools to run.
- Each separate evaluation (if you want to evaluate different codebases)
lives in a separate subdirectory, e.g. `evals/eval-1`.
- Under `evals/eval-N` you find:
- `eval.db` is the eval database; it contains all the eval data.
- Tables `Files`, `Chunks`, `Blobs` are literal copies of
the corresponding tables from the database used by the agent.
- `Hashes` contains a mapping from chunks to hashes.
- `Questions` contains the test questions.
- `Scores` contains the scores for each test question.
- `source` contains the source code of the codebase; for example:
- `source/dispatcher` contains the (frozen) `dispatcher` package.
- `source/README.md` describes the origin of the codebase.
130 changes: 130 additions & 0 deletions ts/packages/agents/spelunker/evals/design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Spelunker design notes

# Evaluation design

## Purpose of the eval

We need to be able to compare different algorithms for selecting chunks
for the oracle context. (For example, using a local fuzzy index based on
embeddings, or sending everything to a cheap model to select; we can
also evaluate prompt engineering attempts.)

Evaluating the oracle is too complicated, we assume that the key to
success is providing the right context. So we evaluate that.

The proposed evaluation approach is thus:

- Take a suitable not-too-large codebase and make a copy of it.
(We don't want the sample code base to vary.)
- Write some questions that are appropriate for the codebase.
(This requires intuition, though an AI might help.)
- Chunk the codebase (fortunately, chunking is deterministic.)
- **Manually** review each chunk for each question, scoring yes/no.
(Or maybe a refined scale like irrelevant, possibly relevant,
likely relevant, or extremely relevant.)
- Now, for each question:
- Send it through the chosen selection process (this is variable).
- Compare the selected chunks with the manual scores.
- Produce an overall score from this.
- Repeat the latter for different selection algorithms

# Building the eval architecture

- Sample codebase stored in test-data/sample-src/\*_/_.{py,ts}
- Permanently chunked into database at test-data/evals.db
- Same schema as main db
- Don't bother with summaries (they're probably totally obsolete)
- There's an extra table with identifying info about the sample
codebase (so we can move the whole sqlite db elsewhere).
- There's also an extra table giving the sample questions for this
sample codebase.
- Then there's a table giving the manual scores for each chunk
and each sample question. (We may have to version this too somehow.)
- Finally there's a table of eval runs, each identifying the algorithm,
its version, head commit, start time, end time,
whether completed without failure, and F1 score.
(We keep all runs, for historical comparisons.)
- We should add the hash of the chunk (including filename,
but excluding chunk IDs) so we can re-chunk the code and not lose
the already scored chunks.
Should lineno be included in this table? Yes if we only expect changes
to the chunking algorithm, no if we expect extensive edits to the
sample codebase.
- Must support multiple codebases, with separate lists of questions.
(Or maybe just separate directories with their own database?)

## Exact schema design

For now we assume a single codebase (maybe mixed-language).

We keep the tables `Files`, `Chunks` and `Blobs` **unchanged** from
the full spelunker app. We run Spelunker with one question over our
sample codebase and copy the database files to the eval directory.
We then add more tables for the following purposes:

- Eval info (start timestamp, end scoring timestamp,
free format notes).
- Table mapping hash to chunk ID.
- Table of test questions, with unique ID (1, 2, etc.).
- Table for manual scores: (question ID, chunk hash, score (0/1), timestamp)
- Table for eval runs (schema TBD)

### Hashing chunks

We hash chunks so that we can keep the scores and eval run results
even when the sample codebase is re-chunked for some reason
(which assigns new chunks everywhere).

However, sometimes unrelated hashes have the same text, e.g.

```py
def __len__(self):
return len(self._data)
```

occurring in multiple classes written using the same design pattern.

How to disambiguate? I propose to include the names of the containing
chunk(s), up to the module root, and also the full filename of the
file where the chunk occurs (in case there are very similar files).

So, for example, the input to the hash function could be:

```
# <filename>
# <great-grandparent> <grandparent> <parent>
<text of chunk>
```

MD5 is a fine, fast hash function, since we're not worried about crypto.

## Manual scoring tool

This is a tedious task, so want to make its UI ergonomic.
Should it use a web UI or command line?

Should be safe to stop and resume at that point later.

Should be possible to remove certain entries to be redone.
(Maybe just delete rows manually using sqlite3 cmd.)

Basically in a loop:

- If the chunk has already been scored for all questions, skip it.
- Display the chunk (mostly from its blobs), probably using less
- For each question that hasn't been scored yet:
- Ask for yes/no, corresponding to "should it be included in the
oracle context"
- As soon as you answer it moves to the next chunk
- Record scores in database immediately, with chunk hash and timestamp

Once we've scored all chunks (hopefully not more than a few 100)
we can move on to the next stage, running the evals:

## Automatic eval runs

TBD

# Random notes

Scoring tool needs more UI. Maybe use colors? (Code colorization, even?)
8 changes: 8 additions & 0 deletions ts/packages/agents/spelunker/evals/eval-1/source/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
# Source notes

This is copied from `TypeAgent/ts/packages/dispatcher/`,
with the `dist` and `node_modules` omitted.

The exact revision used was e7ab730c.

--Guido, 2/20/2025
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) Microsoft Corporation.

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE
Loading

0 comments on commit cb6ec74

Please sign in to comment.