Skip to content

Conversation

@mabaasit
Copy link
Collaborator

@mabaasit mabaasit commented Dec 11, 2025

Added eval tests for gen-ai features using staging chatbot endpoint. When running locally, the experiments will show up on BT experiments.

I compared the two implementations (gen-ai using mms api and chatbot). Results are almost similar in terms of accuracy. Check first three chatbot-experiments and mms-experiments. They were run on same dataset under same conditions (of course the sample of docs would be different).

For additional context regarding Braintrust, check #7216.

Description

Checklist

  • New tests and/or benchmarks are included
  • Documentation is changed or added
  • If this change updates the UI, screenshots/videos are added and a design review is requested
  • If this change could impact the load on the MongoDB cluster, please describe the expected and worst case impact
  • I have signed the MongoDB Contributor License Agreement (https://www.mongodb.com/legal/contributor-agreement)

Motivation and Context

  • Bugfix
  • New feature
  • Dependency update
  • Misc

Open Questions

Dependents

Types of changes

  • Backport Needed
  • Patch (non-breaking change which fixes an issue)
  • Minor (non-breaking change which adds functionality)
  • Major (fix or feature that would cause existing functionality to change)

@github-actions github-actions bot added the feat label Dec 11, 2025
@mabaasit mabaasit added the no release notes Fix or feature not for release notes label Dec 11, 2025
@mabaasit mabaasit marked this pull request as ready for review December 15, 2025 10:06
@mabaasit mabaasit requested a review from a team as a code owner December 15, 2025 10:06
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds evaluation tests for gen-ai features using the staging chatbot endpoint. The evaluation framework uses Braintrust to track experiment results and compares the implementation's accuracy against expected outputs.

  • Adds comprehensive eval test infrastructure using Braintrust
  • Creates test datasets from multiple collections (airbnb, berlin bars, netflix, NYC parking)
  • Implements eval cases for both find and aggregate query generation

Reviewed changes

Copilot reviewed 13 out of 15 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
packages/compass-generative-ai/tests/evals/utils.ts Utility functions for text processing, sampling, and schema generation
packages/compass-generative-ai/tests/evals/types.ts TypeScript type definitions for eval cases and scorers
packages/compass-generative-ai/tests/evals/scorers.ts Factuality scorer implementation using autoevals
packages/compass-generative-ai/tests/evals/gen-ai.eval.ts Main eval entry point configuring the Braintrust evaluation
packages/compass-generative-ai/tests/evals/chatbot-api.ts Chatbot API client for making eval requests
packages/compass-generative-ai/tests/evals/use-cases/index.ts Builds eval cases from test datasets and prompts
packages/compass-generative-ai/tests/evals/use-cases/find-query.ts Find query test cases with expected outputs
packages/compass-generative-ai/tests/evals/use-cases/aggregate-query.ts Aggregate query test cases with expected outputs
packages/compass-generative-ai/tests/evals/fixtures/*.ts Test data fixtures for multiple collections
packages/compass-generative-ai/package.json Added dependencies for AI SDK, braintrust, and autoevals

apiKey: '',
headers: {
'X-Request-Origin': 'compass-gen-ai-braintrust',
'User-Agent': 'mongodb-compass/x.x.x',
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using a placeholder version 'x.x.x' in the User-Agent header is not ideal for tracking or debugging. Consider using an actual version number or a constant that can be updated, or make it clear this is for testing purposes.

Copilot uses AI. Check for mistakes.
Comment on lines 143 to 147
[{
$project: {_id: 0, precio: "$price"},
$sort: {price: 1},
$limit: 1
}]
Copy link

Copilot AI Dec 15, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aggregation pipeline is malformed - it's an array containing a single object with multiple pipeline stages as properties. Each pipeline stage should be a separate object in the array. This should be: [{$project: {_id: 0, precio: \"$price\"}}, {$sort: {price: 1}}, {$limit: 1}].

Suggested change
[{
$project: {_id: 0, precio: "$price"},
$sort: {price: 1},
$limit: 1
}]
[
{$project: {_id: 0, precio: "$price"}},
{$sort: {price: 1}},
{$limit: 1}
]

Copilot uses AI. Check for mistakes.
@mabaasit mabaasit changed the title feat(gen-ai): add eval tests feat(gen-ai): add eval tests COMPASS-10084 Dec 15, 2025
Copy link
Collaborator

@paula-stacho paula-stacho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't test this, but looks good. Thanks for reducing the fixtures!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feat no release notes Fix or feature not for release notes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants