[#10056][test] AutoDeploy: Add accuracy test for Nemotron SuperV3 #10131

galagam · 2025-12-18T17:59:03Z

Summary by CodeRabbit

New Features
- Added Nemotron-Super-V3 benchmark results to accuracy reference datasets with scores of 84.38 (GSM8K) and 79.41 (MMLU)
Tests
- Added accuracy validation tests for Nemotron-Super-V3

Signed-off-by: Gal Hubara Agam <[email protected]>

coderabbitai · 2025-12-18T18:04:39Z

📝 Walkthrough

Walkthrough

Adds accuracy reference data for nvidia/Nemotron-Super-V3 model (GSM8K: 84.38, MMLU: 79.41) in YAML reference files and introduces a new test class TestNemotronSuperV3 in the test suite for BF16 precision validation.

Changes

Cohort / File(s)	Summary
Reference accuracy data `tests/integration/defs/accuracy/references/gsm8k.yaml`, `tests/integration/defs/accuracy/references/mmlu.yaml`	Adds accuracy metrics for nvidia/Nemotron-Super-V3: GSM8K (84.38) and MMLU (79.41 under mistral/Mistral-Large-3-675B)
Test implementation `tests/integration/defs/accuracy/test_llm_api_autodeploy.py`	Introduces TestNemotronSuperV3 class with MODEL\_NAME, MODEL\_PATH\_BF16, default configuration, sampling parameters, and test\_bf16 method evaluating MMLU and GSM8K with AutoDeployLLM (world\_size=8)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10–15 minutes

YAML reference files contain trivial additions (single entries each)
Test class follows established patterns from existing Nemotron tests

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.
Description check	⚠️ Warning	PR description is incomplete. It lacks required sections including a clear problem/motivation explanation, Test Coverage details, and PR Checklist confirmation.	Add a Description section explaining the purpose and motivation for adding Nemotron-Super-V3 accuracy tests. List specific test cases under Test Coverage. Complete the PR Checklist by checking items and confirming adherence to guidelines.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly summarizes the main change: adding an accuracy test for Nemotron SuperV3 with AutoDeploy, matching the changeset which adds test configurations and a new test class.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (2)

tests/integration/defs/accuracy/test_llm_api_autodeploy.py (2)

241-258: Consider removing uncertain comment or determining optimal memory ratio.

Line 247 contains the comment # maybe we can increase, indicating uncertainty about the free_mem_ratio setting. Before enabling this test, verify the optimal memory configuration through testing or remove the comment if 0.5 is the intended value.

268-281: Consider validating memory requirements before enabling the test.

Line 269 contains the comment # might need to require more memory, indicating uncertainty about whether 32000 MB is sufficient. Before enabling this test in CI, run it to confirm the memory guard value is adequate, or remove the comment if 32000 MB is the intended threshold.

The rest of the test implementation correctly follows the established pattern, with world_size=8 matching the skip_less_device(8) guard and appropriate evaluation of both MMLU and GSM8K benchmarks.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b279f4 and 42c9762.

📒 Files selected for processing (3)

tests/integration/defs/accuracy/references/gsm8k.yaml (1 hunks)
tests/integration/defs/accuracy/references/mmlu.yaml (1 hunks)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1 hunks)

🧰 Additional context used

📓 Path-based instructions (2)

**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming: some_file.py
Python classes should use PascalCase naming: class SomeClass
Python functions and methods should use snake_case naming: def my_awesome_function():
Python local variables should use snake_case naming: my_variable = ...
Python variable names that start with a number should be prefixed with 'k': k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G': G_MY_GLOBAL = ...
Python constants should use upper snake_case naming: MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic

Files:

tests/integration/defs/accuracy/test_llm_api_autodeploy.py

**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

tests/integration/defs/accuracy/test_llm_api_autodeploy.py

🧠 Learnings (2)

📚 Learning: 2025-07-28T17:06:08.621Z

Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_autodeploy.py

📚 Learning: 2025-09-09T09:40:45.658Z

Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

tests/integration/defs/accuracy/test_llm_api_autodeploy.py

🧬 Code graph analysis (1)

tests/integration/defs/accuracy/test_llm_api_autodeploy.py (5)

tests/integration/defs/accuracy/accuracy_core.py (4)

LlmapiAccuracyTestHarness (870-881)

MMLU (317-331)

evaluate (184-247)

evaluate (789-799)

tensorrt_llm/_torch/pyexecutor/sampler.py (1)

beam_width (149-152)

tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

use_beam_search (525-526)

tests/unittest/_torch/modeling/test_modeling_out_of_tree.py (1)

sampling_params (58-59)

tensorrt_llm/_torch/auto_deploy/models/factory.py (2)

model (125-127)

tokenizer (130-132)

🔇 Additional comments (3)

tests/integration/defs/accuracy/references/mmlu.yaml (1)

345-346: LGTM!

The MMLU reference data for nvidia/Nemotron-Super-V3 is correctly formatted and placed. The accuracy value (79.41) matches the expectation from the test implementation.

tests/integration/defs/accuracy/references/gsm8k.yaml (1)

288-289: LGTM!

The GSM8K reference data for nvidia/Nemotron-Super-V3 is correctly formatted and placed. The accuracy value (84.38) matches the expectation from the test implementation.

tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)

260-266: LGTM!

The sampling parameters configuration follows the standard pattern used across all test classes in this file, which is appropriate for accuracy evaluation.

tests/integration/defs/accuracy/test_llm_api_autodeploy.py

Signed-off-by: Chenghao Zhang <[email protected]>

nvchenghaoz · 2025-12-18T21:32:40Z

/bot run

tensorrt-cicd · 2025-12-18T21:38:45Z

PR_Github #29029 [ run ] triggered by Bot. Commit: bfbfa62

tensorrt-cicd · 2025-12-19T00:15:00Z

PR_Github #29029 [ run ] completed with state SUCCESS. Commit: bfbfa62
/LLM/main/L0_MergeRequest_PR pipeline #22250 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

nvchenghaoz · 2025-12-19T00:29:32Z

/bot run

tensorrt-cicd · 2025-12-19T00:38:15Z

PR_Github #29051 [ run ] triggered by Bot. Commit: bfbfa62

tensorrt-cicd · 2025-12-19T06:10:03Z

PR_Github #29051 [ run ] completed with state SUCCESS. Commit: bfbfa62
/LLM/main/L0_MergeRequest_PR pipeline #22271 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

nvchenghaoz · 2025-12-19T18:04:42Z

/bot run

tensorrt-cicd · 2025-12-19T18:10:36Z

PR_Github #29173 [ run ] triggered by Bot. Commit: bfbfa62

tensorrt-cicd · 2025-12-19T19:02:43Z

PR_Github #29173 [ run ] completed with state SUCCESS. Commit: bfbfa62
/LLM/main/L0_MergeRequest_PR pipeline #22378 completed with status: 'SUCCESS'

Add accuracy test for Nemotron SuperV3

42c9762

Signed-off-by: Gal Hubara Agam <[email protected]>

galagam requested a review from nvchenghaoz December 18, 2025 17:59

galagam mentioned this pull request Dec 18, 2025

[Feature]: Add an accuracy test for super-v3 #10056

Open

1 task

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

tests/integration/defs/accuracy/test_llm_api_autodeploy.py Show resolved Hide resolved

nvchenghaoz approved these changes Dec 18, 2025

View reviewed changes

Merge branch 'main' into galaga

bfbfa62

Signed-off-by: Chenghao Zhang <[email protected]>

nvchenghaoz changed the title ~~[#10056][test][AutoDeploy] Add accuracy test for Nemotron SuperV3~~ [#10056][test] AutoDeploy: Add accuracy test for Nemotron SuperV3 Dec 18, 2025

[#10056][test] AutoDeploy: Add accuracy test for Nemotron SuperV3 #10131

Are you sure you want to change the base?

[#10056][test] AutoDeploy: Add accuracy test for Nemotron SuperV3 #10131

Conversation

galagam commented Dec 18, 2025 • edited by nvchenghaoz Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nvchenghaoz commented Dec 18, 2025

Uh oh!

tensorrt-cicd commented Dec 18, 2025

Uh oh!

tensorrt-cicd commented Dec 19, 2025

Uh oh!

nvchenghaoz commented Dec 19, 2025

Uh oh!

tensorrt-cicd commented Dec 19, 2025

Uh oh!

tensorrt-cicd commented Dec 19, 2025

Uh oh!

nvchenghaoz commented Dec 19, 2025

Uh oh!

tensorrt-cicd commented Dec 19, 2025

Uh oh!

tensorrt-cicd commented Dec 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

galagam commented Dec 18, 2025 •

edited by nvchenghaoz

Loading

coderabbitai bot commented Dec 18, 2025 •

edited

Loading