chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341

himanshusinghs · 2025-07-07T14:02:55Z

Motivation and Goal

Motivation: Consolidate MCP server tools to accommodate client soft limits, while mitigating the risk of confusing LLMs with potentially ambiguous tool schemas and descriptions.
Goal: Benchmark current tool understanding by LLMs to establish a baseline and prevent regression during consolidation.

Design Brief

Method: Provide LLMs (Gemini, Claude, ChatGPT, etc.) with prompts and current MCP server tool schemas.
Evaluation: Record actual tool calls made by LLMs and compare them against expected calls and parameters.
Reporting: Generate a readable summary of test runs highlighting prompt, models and the achieved accuracy

Detailed Design

Refer to the doc titled - MCP Tools Accuracy Testing

Current State

Framework implemented and integrated to test the MCP server and MongoDB tools.
Accuracy tests for core MongoDB tool calls are written.
The scoring algorithm is implemented and unit-tested.
Supports multiple LLM providers and models.
Snapshots are stored on disk by default, with possibility to store in a MongoDB deployment as well.
On each successful test run a summary is generated highlighting prompt, model and accuracy of tool calls.
Github workflow added to trigger the test runs on a label and manual dispatch. It also attach the summary when triggered for a PR.

For reviewers

Please start reviewing the test cases / prompts themselves.
Once done with the prompts review, move on to the tool calling accuracy scorer. I have added some docs and tests to help understand how it works.
Later you can start reviewing the rest of the accuracy SDK in the folder tests/accuracy/sdk. Start with describe-accuracy-test.ts as this is where all the different parts come together and dive further into specific implementation of each parts afterwards.

Apologies for the big chunk to be reviewed here but I did not see a way around it.

coveralls · 2025-07-07T14:12:56Z

Pull Request Test Coverage Report for Build 16372374490

Details

21 of 21 (100.0%) changed or added relevant lines in 8 files are covered.
3 unchanged lines in 1 file lost coverage.
Overall coverage increased (+4.2%) to 81.823%

Files with Coverage Reduction	New Missed Lines	%
src/common/atlas/apiClient.ts	3	73.9%

Totals
Change from base Build 16350255953:	4.2%
Covered Lines:	2885
Relevant Lines:	3491

💛 - Coveralls

.github/workflows/accuracy-tests.yml

package.json

scripts/generate-test-summary.ts

.github/workflows/accuracy-tests.yml

nirinchev

Still going through it - leaving some comments related to storage as I move on to the actual testing framework.

scripts/accuracy/generate-test-summary.ts

tests/accuracy/sdk/accuracy-result-storage/result-storage.ts

tests/accuracy/sdk/accuracy-result-storage/get-accuracy-result-storage.ts

tests/accuracy/sdk/accuracy-result-storage/mongodb-storage.ts

tests/accuracy/sdk/accuracy-result-storage/disk-storage.ts

scripts/accuracy/generate-test-summary.ts

In some cases, find, aggregate, count, explain, deleteMany, etc we need to grade extra provided arguments depending on the prompt itself. Sometimes additional parameters are fine and sometimes they are not. For example: increasing the keys in filter might lead to a different result hence if any such thing happens, we should grade the accuracy as 0 and not 0.75. To suppor this use-case, this commit introduces the idea of a custom scorer that could be plugged in to accuracy scorer to provided more controlled accuracy grading. Additionally this commit reverts the default behaviour of handling added parameters. Earlier we were marking newly added parameters as hallucinations and hence grading 0.75. But now, after figuring out that most of our tools don't even expect extra parameters, we are flipping the switch and instead will now grade 0 when additional parameters are specified, unless there is a scorer provided to handle the custom scoring logic.

himanshusinghs · 2025-07-18T01:24:04Z

tests/accuracy/sdk/parameterScorer.ts

+                );
+            });
+
+            return hasNonEmptyAdditions ? 0 : 0.75;


This could be more lenient and just return 1 instead of 0.75 in case of empty additions but I thought of keeping it this way to be aligned with how we see hallucinations.

github-actions · 2025-07-18T14:00:22Z

📊 Accuracy Test Results

📈 Summary

Metric	Value
Commit SHA	`5852e04ed6a2fe1af53cdcf706e163c322502b4a`
Run ID	`1a01ef92-71f4-49ea-98a5-8db75fcb5c8d`
Status	done
Total Prompts Evaluated	51
Models Tested	1
Average Accuracy	98.5%
Responses with 0% Accuracy	0
Responses with 75% Accuracy	3
Responses with 100% Accuracy	48

📎 Download Full HTML Report - Look for the accuracy-test-summary artifact for detailed results.

Report generated on: 7/18/2025, 2:00:20 PM

himanshusinghs force-pushed the chore/issue-307-proposal-2 branch 4 times, most recently from 58bc8a5 to b557e02 Compare July 10, 2025 08:53

github-advanced-security bot found potential problems Jul 10, 2025

View reviewed changes

.github/workflows/accuracy-tests.yml Fixed Show fixed Hide fixed

himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from 7791e20 to 79cd26e Compare July 10, 2025 11:37

himanshusinghs changed the title ~~chore(tests): accuracy tests for MongoDB tools exposed by MCP server~~ chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 Jul 10, 2025

himanshusinghs marked this pull request as ready for review July 10, 2025 11:39

himanshusinghs requested a review from a team as a code owner July 10, 2025 11:39

himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from 79cd26e to 6ccaa11 Compare July 10, 2025 15:43

nirinchev reviewed Jul 11, 2025

View reviewed changes

himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from f666014 to 1cc93f2 Compare July 13, 2025 23:37

himanshusinghs added accuracy-tests and removed accuracy-tests labels Jul 14, 2025

mongodb-js deleted a comment from github-actions bot Jul 14, 2025

nirinchev reviewed Jul 14, 2025

View reviewed changes

himanshusinghs added accuracy-tests and removed accuracy-tests labels Jul 14, 2025

gagik reviewed Jul 14, 2025

View reviewed changes

scripts/accuracy/generate-test-summary.ts Outdated Show resolved Hide resolved

scripts/accuracy/generate-test-summary.ts Outdated Show resolved Hide resolved

himanshusinghs removed the accuracy-tests label Jul 14, 2025

himanshusinghs added 9 commits July 18, 2025 03:21

chore: update test file names per naming convention

ba37196

chore: update sdk file names per naming convention

c2a51fd

chore: update accuracy file name per convention

a66553b

chore: move test config out of functions

ab99613

chore: move left out test config out of functions

093ebcf

chore: remove unused func

8496b03

chore: remove orphan checks

4bbcba1

chore: update the test prompt

7c3061d

himanshusinghs force-pushed the chore/issue-307-proposal-2 branch from 8488144 to ec52ee5 Compare July 18, 2025 01:21

himanshusinghs commented Jul 18, 2025

View reviewed changes

chore: ts fixes

743cbfa

himanshusinghs added accuracy-tests and removed accuracy-tests labels Jul 18, 2025

This comment has been minimized.

Sign in to view

nirinchev added 6 commits July 18, 2025 11:13

fix: tweak the arg shapes to improve tool accuracy (#381)

3491a3b

Replace the matcher framework

2909e8a

remove microdiff

49bfac4

fix tests

356512b

don't omit fields for MongoDB storage

8a5a9d2

fix test coverage

2d4e750

nirinchev added accuracy-tests and removed accuracy-tests labels Jul 18, 2025

nirinchev approved these changes Jul 18, 2025

View reviewed changes

nirinchev merged commit 7856bb9 into main Jul 18, 2025
20 checks passed

nirinchev deleted the chore/issue-307-proposal-2 branch July 18, 2025 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341

chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341

Uh oh!

himanshusinghs commented Jul 7, 2025 •

edited

Loading

Uh oh!

coveralls commented Jul 7, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nirinchev left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

himanshusinghs Jul 18, 2025

Uh oh!

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Jul 18, 2025

Uh oh!

Uh oh!

Uh oh!

chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341

chore(tests): accuracy tests for MongoDB tools exposed by MCP server MCP-39 #341

Uh oh!

Conversation

himanshusinghs commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation and Goal

Design Brief

Detailed Design

Current State

For reviewers

Uh oh!

coveralls commented Jul 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request Test Coverage Report for Build 16372374490

Details

💛 - Coveralls

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

nirinchev left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

himanshusinghs Jul 18, 2025

Choose a reason for hiding this comment

Uh oh!

This comment has been minimized.

This comment has been minimized.

github-actions bot commented Jul 18, 2025

📊 Accuracy Test Results

📈 Summary

Uh oh!

Uh oh!

Uh oh!

himanshusinghs commented Jul 7, 2025 •

edited

Loading

coveralls commented Jul 7, 2025 •

edited

Loading