[Spelunker] Setting up an evaluation framework for Spelunker (#741)

Now all I have to do is come up with 10 test questions and score each of them against each of the 426 chunks against it. :-)
microsoft · Feb 21, 2025 · cb6ec74 · cb6ec74
1 parent e7ab730
commit cb6ec74
Show file tree

Hide file tree

Showing 98 changed files with 17,931 additions and 0 deletions.
diff --git a/ts/packages/agents/spelunker/evals/README.md b/ts/packages/agents/spelunker/evals/README.md
@@ -0,0 +1,109 @@
+# Instructions for running evals
+
+These instructions are biased towards Linux.
+For Windows I may have to adjust some of the code
+(especially shell commands).
+
+For more about the design of evaluations, see `evals/design.md`.
+
+This uses the running example of using the dispatcher package
+as sample code.
+
+We use `evals/eval-1` as the directory to hold all eval data.
+(This happens to be the default built into the scripts.)
+
+## 1. Copy source files to EVALDIR (`evals/eval-1`)
+
+Assume the TypeAgent root is `~/TypeAgent`. Adjust to taste.
+
+```shell
+$ mkdir evals/eval-1
+$ mkdir evals/eval-1/source
+$ cp -r ~/TypeAgent/ts/packages/dispatcher evals/eval-1/source/
+$ rm -rf evals/eval-1/source/dispatcher/{dist,node_modules}
+$ rm -rf evals/eval-1/source/dispatcher/package.json
+$
+```
+
+We delete `dist` and `node_mpdules` to save space (Spelunker ignores them).
+We remove `package.json` since otherwise the Repo policy test fails.
+
+## 2. Run Spelunker over the copied sources
+
+NOTE: You must exit the CLI by hitting `^C`.
+
+```shell
+$ cd ~/TypeAgent/ts
+$ pnpm run cli interactive
+...
+{{...}}> @config request spelunker
+Natural langue request handling agent is set to 'spelunker'
+{{<> SPELUNKER}}> .focus ~/TypeAgent/ts/packages/agents/spelunker/evals/eval-1/source/dispatcher
+Focus set to /home/gvanrossum/TypeAgent/ts/packages/agents/spelunker/evals/eval-1/source/dispatcher
+{{<> SPELUNKER}}> Summarize the codebase
+...
+(an enormous grinding of gears)
+...
+The codebase consists of various TypeScript files that define interfaces, functions, classes, and tests for a system that handles commands, agents, and translations. Here is a summary of the key components:
+...
+References: ...
+{{<> SPELUNKER}}> ^C
+ ELIFECYCLE  Command failed with exit code 130.
+ ELIFECYCLE  Command failed with exit code 130.
+$
+```
+
+This leaves the data in the database `~/.typeagent/agents/spelunker/codeSearchDatabase.db`
+
+## 3. Initialize the eval database
+
+You can do this multiple times, but once you've started scoring (#4 below),
+it will erase the scores you've already entered. (TODO: preserve scores.)
+
+```shell
+$ python3 ./evals/src/evalsetup.py --overwrite
+Prefix: /home/gvanrossum/TypeAgent/ts/packages/agents/spelunker/evals/eval-1/source/
+Database: evals/eval-1/eval.db
+...
+(More logging)
+...
+$
+```
+
+## 4. Run the manual scoring tool
+
+You have to score separately for every test question.
+I recommend no more than 10 test questions
+(you have to score 426 chunks for each question).
+
+```shell
+$ python3 evals/src/evalscore.py --question "Summarize the codebase"
+```
+
+The scoring tool repeatedly presents you with a chunk of text,
+prompting you to enter `0` or `1` (or `y`/`n`) for each.
+
+Chunks are colorized and piped through `less`;
+if a chunk doesn't fit on the page you'll get to page through it.
+
+## 5. Run an evaluation
+
+TBD.
+
+## 6. Where everything is
+
+The `evals` directory under `spelunker` contains everything you need.
+
+- `evals/src` contains the tools to run.
+- Each separate evaluation (if you want to evaluate different codebases)
+  lives in a separate subdirectory, e.g. `evals/eval-1`.
+- Under `evals/eval-N` you find:
+  - `eval.db` is the eval database; it contains all the eval data.
+    - Tables `Files`, `Chunks`, `Blobs` are literal copies of
+      the corresponding tables from the database used by the agent.
+    - `Hashes` contains a mapping from chunks to hashes.
+    - `Questions` contains the test questions.
+    - `Scores` contains the scores for each test question.
+  - `source` contains the source code of the codebase; for example:
+    - `source/dispatcher` contains the (frozen) `dispatcher` package.
+    - `source/README.md` describes the origin of the codebase.
diff --git a/ts/packages/agents/spelunker/evals/design.md b/ts/packages/agents/spelunker/evals/design.md
@@ -0,0 +1,130 @@
+# Spelunker design notes
+
+# Evaluation design
+
+## Purpose of the eval
+
+We need to be able to compare different algorithms for selecting chunks
+for the oracle context. (For example, using a local fuzzy index based on
+embeddings, or sending everything to a cheap model to select; we can
+also evaluate prompt engineering attempts.)
+
+Evaluating the oracle is too complicated, we assume that the key to
+success is providing the right context. So we evaluate that.
+
+The proposed evaluation approach is thus:
+
+- Take a suitable not-too-large codebase and make a copy of it.
+  (We don't want the sample code base to vary.)
+- Write some questions that are appropriate for the codebase.
+  (This requires intuition, though an AI might help.)
+- Chunk the codebase (fortunately, chunking is deterministic.)
+- **Manually** review each chunk for each question, scoring yes/no.
+  (Or maybe a refined scale like irrelevant, possibly relevant,
+  likely relevant, or extremely relevant.)
+- Now, for each question:
+  - Send it through the chosen selection process (this is variable).
+  - Compare the selected chunks with the manual scores.
+  - Produce an overall score from this.
+- Repeat the latter for different selection algorithms
+
+# Building the eval architecture
+
+- Sample codebase stored in test-data/sample-src/\*_/_.{py,ts}
+- Permanently chunked into database at test-data/evals.db
+  - Same schema as main db
+  - Don't bother with summaries (they're probably totally obsolete)
+- There's an extra table with identifying info about the sample
+  codebase (so we can move the whole sqlite db elsewhere).
+- There's also an extra table giving the sample questions for this
+  sample codebase.
+- Then there's a table giving the manual scores for each chunk
+  and each sample question. (We may have to version this too somehow.)
+- Finally there's a table of eval runs, each identifying the algorithm,
+  its version, head commit, start time, end time,
+  whether completed without failure, and F1 score.
+  (We keep all runs, for historical comparisons.)
+- We should add the hash of the chunk (including filename,
+  but excluding chunk IDs) so we can re-chunk the code and not lose
+  the already scored chunks.
+  Should lineno be included in this table? Yes if we only expect changes
+  to the chunking algorithm, no if we expect extensive edits to the
+  sample codebase.
+- Must support multiple codebases, with separate lists of questions.
+  (Or maybe just separate directories with their own database?)
+
+## Exact schema design
+
+For now we assume a single codebase (maybe mixed-language).
+
+We keep the tables `Files`, `Chunks` and `Blobs` **unchanged** from
+the full spelunker app. We run Spelunker with one question over our
+sample codebase and copy the database files to the eval directory.
+We then add more tables for the following purposes:
+
+- Eval info (start timestamp, end scoring timestamp,
+  free format notes).
+- Table mapping hash to chunk ID.
+- Table of test questions, with unique ID (1, 2, etc.).
+- Table for manual scores: (question ID, chunk hash, score (0/1), timestamp)
+- Table for eval runs (schema TBD)
+
+### Hashing chunks
+
+We hash chunks so that we can keep the scores and eval run results
+even when the sample codebase is re-chunked for some reason
+(which assigns new chunks everywhere).
+
+However, sometimes unrelated hashes have the same text, e.g.
+
+```py
+    def __len__(self):
+        return len(self._data)
+```
+
+occurring in multiple classes written using the same design pattern.
+
+How to disambiguate? I propose to include the names of the containing
+chunk(s), up to the module root, and also the full filename of the
+file where the chunk occurs (in case there are very similar files).
+
+So, for example, the input to the hash function could be:
+
+```
+# <filename>
+# <great-grandparent> <grandparent> <parent>
+<text of chunk>
+```
+
+MD5 is a fine, fast hash function, since we're not worried about crypto.
+
+## Manual scoring tool
+
+This is a tedious task, so want to make its UI ergonomic.
+Should it use a web UI or command line?
+
+Should be safe to stop and resume at that point later.
+
+Should be possible to remove certain entries to be redone.
+(Maybe just delete rows manually using sqlite3 cmd.)
+
+Basically in a loop:
+
+- If the chunk has already been scored for all questions, skip it.
+- Display the chunk (mostly from its blobs), probably using less
+- For each question that hasn't been scored yet:
+  - Ask for yes/no, corresponding to "should it be included in the
+    oracle context"
+- As soon as you answer it moves to the next chunk
+- Record scores in database immediately, with chunk hash and timestamp
+
+Once we've scored all chunks (hopefully not more than a few 100)
+we can move on to the next stage, running the evals:
+
+## Automatic eval runs
+
+TBD
+
+# Random notes
+
+Scoring tool needs more UI. Maybe use colors? (Code colorization, even?)
diff --git a/ts/packages/agents/spelunker/evals/eval-1/source/README.md b/ts/packages/agents/spelunker/evals/eval-1/source/README.md
@@ -0,0 +1,8 @@
+# Source notes
+
+This is copied from `TypeAgent/ts/packages/dispatcher/`,
+with the `dist` and `node_modules` omitted.
+
+The exact revision used was e7ab730c.
+
+--Guido, 2/20/2025
diff --git a/ts/packages/agents/spelunker/evals/eval-1/source/dispatcher/LICENSE b/ts/packages/agents/spelunker/evals/eval-1/source/dispatcher/LICENSE
@@ -0,0 +1,21 @@
+    MIT License
+
+    Copyright (c) Microsoft Corporation.
+
+    Permission is hereby granted, free of charge, to any person obtaining a copy
+    of this software and associated documentation files (the "Software"), to deal
+    in the Software without restriction, including without limitation the rights
+    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+    copies of the Software, and to permit persons to whom the Software is
+    furnished to do so, subject to the following conditions:
+
+    The above copyright notice and this permission notice shall be included in all
+    copies or substantial portions of the Software.
+
+    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+    SOFTWARE