Adds semantic Search Evaluation Bench by PascalSenn · Pull Request #140 · graphql/ai-wg

PascalSenn · 2026-06-14T18:47:26Z

Add semantic-search / semantic-introspection evaluation suite

This PR contributes an evaluation suite for semantic search over large GraphQL schemas under benchmark/evaluation/.

What it is

Real GraphQL schemas are too big to paste into a prompt. To use one with an LLM you have to semantically find the part that matters - either by retrieving and rendering a small slice of the schema (a so called subschema), or by letting an agent introspect it through tools. This suite measures different algorithms head-to-head, with numbers instead of opinions.

There is a set of different benchmarks. Each benchmark holds everything else fixed and varies a single axis, so a result is attributable to that axis alone:

`pnpm eval <x>`	what it varies	what it measures
models	the embedding model	retrieval recall@K
templates	the text embedded per field	field recall@K
type-templates	the text embedded per type	type recall@K
strategies (default)	the selection / slicing algorithm	perfect-recall % + token cost
agent	chat model × strategy × prompt	answer-success % + cost

The first four are the offline semantic-search stack, deterministic retrieval scored against a known-good field set. The fifth is the agentic counterpart: a chat model discovers the schema live through tools (search → SDL slice, execute_graphql_operation against a mock server, answer) and has to answer a real natural-language question end to end. Together they cover the whole pipeline from "embed the schema" to "an agent answered the user."

Corpus: 816 queries across 5 schemas (github, gitlab, linear, shopify, singapore), each a YAML carrying ground truth for both benchmark families.
Scoring is deterministic and cached: recall is graded on the rendered slice; agent success is a tolerant deep-equal against a gold answer (no LLM judge). Completed sweeps re-run for free from a content-addressed cache.
Reproducible: pnpm install, set OPENAI_API_KEY, pnpm eval. See benchmark/evaluation/README.md for the full picture.

⚠️ Disclaimer

Most of the code in this contribution is AI-generated and has not been line-by-line reviewed. The research here was conducted largely by observing behaviour ( running the benchmarks, comparing leaderboards, and trusting the deterministic, cached scoring( rather than by reviewing the implementation. The findings in insights/ are independently re-derived from results, not from code audits.

PascalSenn added 3 commits June 14, 2026 19:51

adds semantic search evaluation bench

e2e8816

adds semantic search evaluation bench

7a82b93

adds semantic search evaluation bench

72e3908

ThoreKoritzius mentioned this pull request Jun 24, 2026

Feat/qwen3 emb graphql bench #142

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Adds semantic Search Evaluation Bench#140

Adds semantic Search Evaluation Bench#140
PascalSenn wants to merge 3 commits into
mainfrom
pse/adds-semantic-search-evaulation-bench

PascalSenn commented Jun 14, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

PascalSenn commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add semantic-search / semantic-introspection evaluation suite

What it is

⚠️ Disclaimer

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

PascalSenn commented Jun 14, 2026 •

edited

Loading