Skip to content

Conversation

@galagam
Copy link
Collaborator

@galagam galagam commented Dec 18, 2025

Summary by CodeRabbit

  • New Features

    • Added Nemotron-Super-V3 benchmark results to accuracy reference datasets with scores of 84.38 (GSM8K) and 79.41 (MMLU)
  • Tests

    • Added accuracy validation tests for Nemotron-Super-V3

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 18, 2025

📝 Walkthrough

Walkthrough

Adds accuracy reference data for nvidia/Nemotron-Super-V3 model (GSM8K: 84.38, MMLU: 79.41) in YAML reference files and introduces a new test class TestNemotronSuperV3 in the test suite for BF16 precision validation.

Changes

Cohort / File(s) Summary
Reference accuracy data
tests/integration/defs/accuracy/references/gsm8k.yaml, tests/integration/defs/accuracy/references/mmlu.yaml
Adds accuracy metrics for nvidia/Nemotron-Super-V3: GSM8K (84.38) and MMLU (79.41 under mistral/Mistral-Large-3-675B)
Test implementation
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
Introduces TestNemotronSuperV3 class with MODEL\_NAME, MODEL\_PATH\_BF16, default configuration, sampling parameters, and test\_bf16 method evaluating MMLU and GSM8K with AutoDeployLLM (world\_size=8)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10–15 minutes

  • YAML reference files contain trivial additions (single entries each)
  • Test class follows established patterns from existing Nemotron tests

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Description check ⚠️ Warning PR description is incomplete. It lacks required sections including a clear problem/motivation explanation, Test Coverage details, and PR Checklist confirmation. Add a Description section explaining the purpose and motivation for adding Nemotron-Super-V3 accuracy tests. List specific test cases under Test Coverage. Complete the PR Checklist by checking items and confirming adherence to guidelines.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding an accuracy test for Nemotron SuperV3 with AutoDeploy, matching the changeset which adds test configurations and a new test class.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (2)

241-258: Consider removing uncertain comment or determining optimal memory ratio.

Line 247 contains the comment # maybe we can increase, indicating uncertainty about the free_mem_ratio setting. Before enabling this test, verify the optimal memory configuration through testing or remove the comment if 0.5 is the intended value.


268-281: Consider validating memory requirements before enabling the test.

Line 269 contains the comment # might need to require more memory, indicating uncertainty about whether 32000 MB is sufficient. Before enabling this test in CI, run it to confirm the memory guard value is adequate, or remove the comment if 32000 MB is the intended threshold.

The rest of the test implementation correctly follows the established pattern, with world_size=8 matching the skip_less_device(8) guard and appropriate evaluation of both MMLU and GSM8K benchmarks.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0b279f4 and 42c9762.

📒 Files selected for processing (3)
  • tests/integration/defs/accuracy/references/gsm8k.yaml (1 hunks)
  • tests/integration/defs/accuracy/references/mmlu.yaml (1 hunks)
  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming: some_file.py
Python classes should use PascalCase naming: class SomeClass
Python functions and methods should use snake_case naming: def my_awesome_function():
Python local variables should use snake_case naming: my_variable = ...
Python variable names that start with a number should be prefixed with 'k': k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G': G_MY_GLOBAL = ...
Python constants should use upper snake_case naming: MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic

Files:

  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
🧠 Learnings (2)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.

Applied to files:

  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.

Applied to files:

  • tests/integration/defs/accuracy/test_llm_api_autodeploy.py
🧬 Code graph analysis (1)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (5)
tests/integration/defs/accuracy/accuracy_core.py (4)
  • LlmapiAccuracyTestHarness (870-881)
  • MMLU (317-331)
  • evaluate (184-247)
  • evaluate (789-799)
tensorrt_llm/_torch/pyexecutor/sampler.py (1)
  • beam_width (149-152)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
  • use_beam_search (525-526)
tests/unittest/_torch/modeling/test_modeling_out_of_tree.py (1)
  • sampling_params (58-59)
tensorrt_llm/_torch/auto_deploy/models/factory.py (2)
  • model (125-127)
  • tokenizer (130-132)
🔇 Additional comments (3)
tests/integration/defs/accuracy/references/mmlu.yaml (1)

345-346: LGTM!

The MMLU reference data for nvidia/Nemotron-Super-V3 is correctly formatted and placed. The accuracy value (79.41) matches the expectation from the test implementation.

tests/integration/defs/accuracy/references/gsm8k.yaml (1)

288-289: LGTM!

The GSM8K reference data for nvidia/Nemotron-Super-V3 is correctly formatted and placed. The accuracy value (84.38) matches the expectation from the test implementation.

tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)

260-266: LGTM!

The sampling parameters configuration follows the standard pattern used across all test classes in this file, which is appropriate for accuracy evaluation.

Signed-off-by: Chenghao Zhang <[email protected]>
@nvchenghaoz nvchenghaoz changed the title [#10056][test][AutoDeploy] Add accuracy test for Nemotron SuperV3 [#10056][test] AutoDeploy: Add accuracy test for Nemotron SuperV3 Dec 18, 2025
@nvchenghaoz
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29029 [ run ] triggered by Bot. Commit: bfbfa62

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29029 [ run ] completed with state SUCCESS. Commit: bfbfa62
/LLM/main/L0_MergeRequest_PR pipeline #22250 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@nvchenghaoz
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29051 [ run ] triggered by Bot. Commit: bfbfa62

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29051 [ run ] completed with state SUCCESS. Commit: bfbfa62
/LLM/main/L0_MergeRequest_PR pipeline #22271 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@nvchenghaoz
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29173 [ run ] triggered by Bot. Commit: bfbfa62

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29173 [ run ] completed with state SUCCESS. Commit: bfbfa62
/LLM/main/L0_MergeRequest_PR pipeline #22378 completed with status: 'SUCCESS'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants