vitest-evals

Evaluate LLM outputs using the familiar Vitest testing framework.

Installation

npm install -D vitest-evals

Quick Start

import { describeEval } from "vitest-evals";

describeEval("capital cities", {
  data: async () => [
    { input: "What is the capital of France?", expected: "Paris" },
    { input: "What is the capital of Japan?", expected: "Tokyo" },
  ],
  task: async (input) => {
    const response = await queryLLM(input);
    return response; // Simple string return
  },
  scorers: [
    async ({ output, expected }) => ({
      score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0,
    }),
  ],
  threshold: 0.8,
});

Tasks

Tasks process inputs and return outputs. Two formats are supported:

// Simple: just return a string
const task = async (input) => "response";

// With tool tracking: return a TaskResult
const task = async (input) => ({
  result: "response",
  toolCalls: [
    { name: "search", arguments: { query: "..." }, result: {...} }
  ]
});

Scorers

Scorers evaluate outputs and return a score (0-1). Use built-in scorers or create your own:

// Built-in scorer
import { ToolCallScorer } from "vitest-evals";
// Or import individually
import { ToolCallScorer } from "vitest-evals/scorers/toolCallScorer";

describeEval("tool usage", {
  data: async () => [
    { input: "Search weather", expectedTools: [{ name: "weather_api" }] },
  ],
  task: weatherTask,
  scorers: [ToolCallScorer()],
});

// Custom scorer
const LengthScorer = async ({ output }) => ({
  score: output.length > 50 ? 1.0 : 0.0,
});

// TypeScript scorer with custom options
import { type ScoreFn, type BaseScorerOptions } from "vitest-evals";

interface CustomOptions extends BaseScorerOptions {
  minLength: number;
}

const TypedScorer: ScoreFn<CustomOptions> = async (opts) => ({
  score: opts.output.length >= opts.minLength ? 1.0 : 0.0,
});

Built-in Scorers

ToolCallScorer

Evaluates if the expected tools were called with correct arguments.

// Basic usage - strict matching, any order
describeEval("search test", {
  data: async () => [
    {
      input: "Find Italian restaurants",
      expectedTools: [
        { name: "search", arguments: { type: "restaurant" } },
        { name: "filter", arguments: { cuisine: "italian" } },
      ],
    },
  ],
  task: myTask,
  scorers: [ToolCallScorer()],
});

// Strict evaluation - exact order and parameters
scorers: [
  ToolCallScorer({
    ordered: true, // Tools must be in exact order
    params: "strict", // Parameters must match exactly
  }),
];

// Flexible evaluation
scorers: [
  ToolCallScorer({
    requireAll: false, // Partial matches give partial credit
    allowExtras: false, // No additional tools allowed
  }),
];

Default behavior:

Strict parameter matching (exact equality required)
Any order allowed
Extra tools allowed
All expected tools required

StructuredOutputScorer

Evaluates if the output matches expected structured data (JSON).

// Basic usage - strict matching
describeEval("query generation", {
  data: async () => [
    {
      input: "Show me errors from today",
      expected: {
        dataset: "errors",
        query: "",
        sort: "-timestamp",
        timeRange: { statsPeriod: "24h" }
      }
    }
  ],
  task: myTask,
  scorers: [StructuredOutputScorer()]
});

// Fuzzy matching with regex patterns
scorers: [
  StructuredOutputScorer({
    match: "fuzzy", // More flexible matching
  })
];

// Custom validation
scorers: [
  StructuredOutputScorer({
    match: (expected, actual, key) => {
      if (key === "age") return actual >= 18 && actual <= 100;
      return expected === actual;
    }
  })
];

// Partial credit for incomplete matches
scorers: [
  StructuredOutputScorer({
    requireAll: false, // Partial matches give partial credit
    allowExtras: false, // No additional fields allowed
  })
];

Features:

Strict matching (default): Exact equality for all fields
Fuzzy matching: Case-insensitive strings, numeric tolerance (0.1%), regex patterns, unordered arrays
Custom matchers: Define your own validation logic per field
Error detection: Automatically fails if output contains an error field
Partial credit: Optional scoring based on percentage of matching fields

Default behavior:

Strict field matching (exact equality required)
Extra fields allowed
All expected fields required
Checks for "error" field in output

AI SDK Integration

See src/ai-sdk-integration.test.ts for a complete example with the Vercel AI SDK.

Transform provider responses to our format:

const { text, steps } = await generateText({
  model: openai("gpt-4o"),
  prompt: input,
  tools: { myTool: myToolDefinition },
});

return {
  result: text,
  toolCalls: steps
    .flatMap((step) => step.toolCalls)
    .map((call) => ({
      name: call.toolName,
      arguments: call.args,
    })),
};

Advanced Usage

Advanced Scorers

Using autoevals

For sophisticated evaluation, use autoevals scorers:

import { Factuality, ClosedQA } from "autoevals";

scorers: [
  Factuality, // LLM-based factuality checking
  ClosedQA.partial({
    criteria: "Does the answer mention Paris?",
  }),
];

Custom LLM-based Factuality Scorer

Here's an example of implementing your own LLM-based factuality scorer using the Vercel AI SDK:

import { generateObject } from "ai";
import { openai } from "@ai-sdk/openai";
import { z } from "zod";

const Factuality = (model = openai("gpt-4o")) => async ({ input, output, expected }) => {
    if (!expected) {
      return { score: 1.0, metadata: { rationale: "No expected answer" } };
    }

    const { object } = await generateObject({
      model,
      prompt: `
      Compare the factual content of the submitted answer with the expert answer.
      
      Question: ${input}
      Expert: ${expected}
      Submission: ${output}
      
      Options:
      (A) Subset of expert answer
      (B) Superset of expert answer  
      (C) Same content as expert
      (D) Contradicts expert answer
      (E) Different but factually equivalent
    `,
      schema: z.object({
        answer: z.enum(["A", "B", "C", "D", "E"]),
        rationale: z.string(),
      }),
    });

    const scores = { A: 0.4, B: 0.6, C: 1, D: 0, E: 1 };
    return {
      score: scores[object.answer],
      metadata: { rationale: object.rationale, answer: object.answer },
    };
  };

// Usage
scorers: [Factuality()];

Skip Tests Conditionally

describeEval("gpt-4 tests", {
  skipIf: () => !process.env.OPENAI_API_KEY,
  // ...
});

Existing Test Suites

For integration with existing Vitest test suites, you can use the .toEval() matcher:

⚠️ Deprecated: The .toEval() helper is deprecated. Use describeEval() instead for better test organization and multiple scorers support. We may consider bringing back a similar check, but its currently too limited for many scorer implementations.

import "vitest-evals";

test("capital check", () => {
  const simpleFactuality = async ({ output, expected }) => ({
    score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0,
  });

  expect("What is the capital of France?").toEval(
    "Paris",
    answerQuestion,
    simpleFactuality,
    0.8
  );
});

Recommended migration to describeEval():

import { describeEval } from "vitest-evals";

describeEval("capital check", {
  data: async () => [
    { input: "What is the capital of France?", expected: "Paris" },
  ],
  task: answerQuestion,
  scorers: [
    async ({ output, expected }) => ({
      score: output.toLowerCase().includes(expected.toLowerCase()) ? 1.0 : 0.0,
    }),
  ],
  threshold: 0.8,
});

Configuration

Separate Eval Configuration

Create vitest.evals.config.ts:

import { defineConfig } from "vitest/config";
import defaultConfig from "./vitest.config";

export default defineConfig({
  ...defaultConfig,
  test: {
    ...defaultConfig.test,
    include: ["src/**/*.eval.{js,ts}"],
  },
});

Run evals separately:

vitest --config=vitest.evals.config.ts

Development

npm install
npm test

Name		Name	Last commit message	Last commit date
Latest commit History 72 Commits
.github/workflows		.github/workflows
.vscode		.vscode
docs		docs
scripts		scripts
src		src
.craft.yml		.craft.yml
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
LICENSE		LICENSE
README.md		README.md
biome.json		biome.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
tsconfig.json		tsconfig.json
tsup.config.ts		tsup.config.ts

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

vitest-evals

Installation

Quick Start

Tasks

Scorers

Built-in Scorers

ToolCallScorer

StructuredOutputScorer

AI SDK Integration

Advanced Usage

Advanced Scorers

Using autoevals

Custom LLM-based Factuality Scorer

Skip Tests Conditionally

Existing Test Suites

Configuration

Separate Eval Configuration

Development

About

Uh oh!

Releases 6

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 7

Uh oh!

Languages

Uh oh!

License

getsentry/vitest-evals

Folders and files

Latest commit

History

Repository files navigation

vitest-evals

Installation

Quick Start

Tasks

Scorers

Built-in Scorers

ToolCallScorer

StructuredOutputScorer

AI SDK Integration

Advanced Usage

Advanced Scorers

Using autoevals

Custom LLM-based Factuality Scorer

Skip Tests Conditionally

Existing Test Suites

Configuration

Separate Eval Configuration

Development

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 6

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 7

Uh oh!

Languages

Packages