Adds semantic Search Evaluation Bench#140
Open
PascalSenn wants to merge 3 commits into
Open
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add semantic-search / semantic-introspection evaluation suite
This PR contributes an evaluation suite for semantic search over large GraphQL schemas under
benchmark/evaluation/.What it is
Real GraphQL schemas are too big to paste into a prompt. To use one with an LLM you have to semantically find the part that matters - either by retrieving and rendering a small slice of the schema (a so called subschema), or by letting an agent introspect it through tools. This suite measures different algorithms head-to-head, with numbers instead of opinions.
There is a set of different benchmarks. Each benchmark holds everything else fixed and varies a single axis, so a result is attributable to that axis alone:
pnpm eval <x>The first four are the offline semantic-search stack, deterministic retrieval scored against a known-good field set. The fifth is the agentic counterpart: a chat model discovers the schema live through tools (
search→ SDL slice,execute_graphql_operationagainst a mock server,answer) and has to answer a real natural-language question end to end. Together they cover the whole pipeline from "embed the schema" to "an agent answered the user."pnpm install, setOPENAI_API_KEY,pnpm eval. Seebenchmark/evaluation/README.mdfor the full picture.Most of the code in this contribution is AI-generated and has not been line-by-line reviewed. The research here was conducted largely by observing behaviour ( running the benchmarks, comparing leaderboards, and trusting the deterministic, cached scoring( rather than by reviewing the implementation. The findings in
insights/are independently re-derived from results, not from code audits.