Skip to content

Adds semantic Search Evaluation Bench#140

Open
PascalSenn wants to merge 3 commits into
mainfrom
pse/adds-semantic-search-evaulation-bench
Open

Adds semantic Search Evaluation Bench#140
PascalSenn wants to merge 3 commits into
mainfrom
pse/adds-semantic-search-evaulation-bench

Conversation

@PascalSenn

@PascalSenn PascalSenn commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Add semantic-search / semantic-introspection evaluation suite

This PR contributes an evaluation suite for semantic search over large GraphQL schemas under benchmark/evaluation/.

What it is

Real GraphQL schemas are too big to paste into a prompt. To use one with an LLM you have to semantically find the part that matters - either by retrieving and rendering a small slice of the schema (a so called subschema), or by letting an agent introspect it through tools. This suite measures different algorithms head-to-head, with numbers instead of opinions.

There is a set of different benchmarks. Each benchmark holds everything else fixed and varies a single axis, so a result is attributable to that axis alone:

pnpm eval <x> what it varies what it measures
models the embedding model retrieval recall@K
templates the text embedded per field field recall@K
type-templates the text embedded per type type recall@K
strategies (default) the selection / slicing algorithm perfect-recall % + token cost
agent chat model × strategy × prompt answer-success % + cost

The first four are the offline semantic-search stack, deterministic retrieval scored against a known-good field set. The fifth is the agentic counterpart: a chat model discovers the schema live through tools (search → SDL slice, execute_graphql_operation against a mock server, answer) and has to answer a real natural-language question end to end. Together they cover the whole pipeline from "embed the schema" to "an agent answered the user."

  • Corpus: 816 queries across 5 schemas (github, gitlab, linear, shopify, singapore), each a YAML carrying ground truth for both benchmark families.
  • Scoring is deterministic and cached: recall is graded on the rendered slice; agent success is a tolerant deep-equal against a gold answer (no LLM judge). Completed sweeps re-run for free from a content-addressed cache.
  • Reproducible: pnpm install, set OPENAI_API_KEY, pnpm eval. See benchmark/evaluation/README.md for the full picture.

⚠️ Disclaimer

Most of the code in this contribution is AI-generated and has not been line-by-line reviewed. The research here was conducted largely by observing behaviour ( running the benchmarks, comparing leaderboards, and trusting the deterministic, cached scoring( rather than by reviewing the implementation. The findings in insights/ are independently re-derived from results, not from code audits.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant