Skip to content

chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 91 commits into from
Jul 18, 2025

Conversation

himanshusinghs
Copy link
Collaborator

@himanshusinghs himanshusinghs commented Jul 7, 2025

Motivation and Goal

  • Motivation: Consolidate MCP server tools to accommodate client soft limits, while mitigating the risk of confusing LLMs with potentially ambiguous tool schemas and descriptions.
  • Goal: Benchmark current tool understanding by LLMs to establish a baseline and prevent regression during consolidation.

Design Brief

  • Method: Provide LLMs (Gemini, Claude, ChatGPT, etc.) with prompts and current MCP server tool schemas.
  • Evaluation: Record actual tool calls made by LLMs and compare them against expected calls and parameters.
  • Reporting: Generate a readable summary of test runs highlighting prompt, models and the achieved accuracy

Detailed Design

Refer to the doc titled - MCP Tools Accuracy Testing

Current State

  • Framework implemented and integrated to test the MCP server and MongoDB tools.
  • Accuracy tests for core MongoDB tool calls are written.
  • The scoring algorithm is implemented and unit-tested.
  • Supports multiple LLM providers and models.
  • Snapshots are stored on disk by default, with possibility to store in a MongoDB deployment as well.
  • On each successful test run a summary is generated highlighting prompt, model and accuracy of tool calls.
  • Github workflow added to trigger the test runs on a label and manual dispatch. It also attach the summary when triggered for a PR.

For reviewers

  • Please start reviewing the test cases / prompts themselves.
  • Once done with the prompts review, move on to the tool calling accuracy scorer. I have added some docs and tests to help understand how it works.
  • Later you can start reviewing the rest of the accuracy SDK in the folder tests/accuracy/sdk. Start with describe-accuracy-test.ts as this is where all the different parts come together and dive further into specific implementation of each parts afterwards.

Apologies for the big chunk to be reviewed here but I did not see a way around it.

@coveralls
Copy link
Collaborator

coveralls commented Jul 7, 2025

Pull Request Test Coverage Report for Build 16372374490

Details

  • 21 of 21 (100.0%) changed or added relevant lines in 8 files are covered.
  • 3 unchanged lines in 1 file lost coverage.
  • Overall coverage increased (+4.2%) to 81.823%

Files with Coverage Reduction New Missed Lines %
src/common/atlas/apiClient.ts 3 73.9%
Totals Coverage Status
Change from base Build 16350255953: 4.2%
Covered Lines: 2885
Relevant Lines: 3491

💛 - Coveralls

@himanshusinghs himanshusinghs force-pushed the chore/issue-307-proposal-2 branch 4 times, most recently from 58bc8a5 to b557e02 Compare July 10, 2025 08:53
@himanshusinghs himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from 7791e20 to 79cd26e Compare July 10, 2025 11:37
@himanshusinghs himanshusinghs changed the title chore(tests): accuracy tests for MongoDB tools exposed by MCP server chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 Jul 10, 2025
@himanshusinghs himanshusinghs marked this pull request as ready for review July 10, 2025 11:39
@himanshusinghs himanshusinghs requested a review from a team as a code owner July 10, 2025 11:39
@himanshusinghs himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from 79cd26e to 6ccaa11 Compare July 10, 2025 15:43
Copy link
Collaborator

@nirinchev nirinchev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still going through it - leaving some comments related to storage as I move on to the actual testing framework.

In some cases, find, aggregate, count, explain, deleteMany, etc we need
to grade extra provided arguments depending on the prompt itself.
Sometimes additional parameters are fine and sometimes they are not. For
example: increasing the keys in filter might lead to a different result
hence if any such thing happens, we should grade the accuracy as 0 and
not 0.75. To suppor this use-case, this commit introduces the idea of
a custom scorer that could be plugged in to accuracy scorer to provided
more controlled accuracy grading.

Additionally this commit reverts the default behaviour of handling added
parameters. Earlier we were marking newly added parameters as
hallucinations and hence grading 0.75. But now, after figuring out that
most of our tools don't even expect extra parameters, we are flipping
the switch and instead will now grade 0 when additional parameters are
specified, unless there is a scorer provided to handle the custom
scoring logic.
@himanshusinghs himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from 8488144 to ec52ee5 Compare July 18, 2025 01:21
);
});

return hasNonEmptyAdditions ? 0 : 0.75;
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be more lenient and just return 1 instead of 0.75 in case of empty additions but I thought of keeping it this way to be aligned with how we see hallucinations.

This comment has been minimized.

This comment has been minimized.

Copy link
Contributor

📊 Accuracy Test Results

📈 Summary

Metric Value
Commit SHA 5852e04ed6a2fe1af53cdcf706e163c322502b4a
Run ID 1a01ef92-71f4-49ea-98a5-8db75fcb5c8d
Status done
Total Prompts Evaluated 51
Models Tested 1
Average Accuracy 98.5%
Responses with 0% Accuracy 0
Responses with 75% Accuracy 3
Responses with 100% Accuracy 48

📎 Download Full HTML Report - Look for the accuracy-test-summary artifact for detailed results.

Report generated on: 7/18/2025, 2:00:20 PM

@nirinchev nirinchev merged commit 7856bb9 into main Jul 18, 2025
20 checks passed
@nirinchev nirinchev deleted the chore/issue-307-proposal-2 branch July 18, 2025 14:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants