-
Notifications
You must be signed in to change notification settings - Fork 2k
[#10056][test] AutoDeploy: Add accuracy test for Nemotron SuperV3 #10131
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Gal Hubara Agam <[email protected]>
📝 WalkthroughWalkthroughAdds accuracy reference data for nvidia/Nemotron-Super-V3 model (GSM8K: 84.38, MMLU: 79.41) in YAML reference files and introduces a new test class TestNemotronSuperV3 in the test suite for BF16 precision validation. Changes
Estimated code review effort🎯 2 (Simple) | ⏱️ ~10–15 minutes
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Nitpick comments (2)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (2)
241-258: Consider removing uncertain comment or determining optimal memory ratio.Line 247 contains the comment
# maybe we can increase, indicating uncertainty about thefree_mem_ratiosetting. Before enabling this test, verify the optimal memory configuration through testing or remove the comment if 0.5 is the intended value.
268-281: Consider validating memory requirements before enabling the test.Line 269 contains the comment
# might need to require more memory, indicating uncertainty about whether 32000 MB is sufficient. Before enabling this test in CI, run it to confirm the memory guard value is adequate, or remove the comment if 32000 MB is the intended threshold.The rest of the test implementation correctly follows the established pattern, with
world_size=8matching theskip_less_device(8)guard and appropriate evaluation of both MMLU and GSM8K benchmarks.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
tests/integration/defs/accuracy/references/gsm8k.yaml(1 hunks)tests/integration/defs/accuracy/references/mmlu.yaml(1 hunks)tests/integration/defs/accuracy/test_llm_api_autodeploy.py(1 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming:some_file.py
Python classes should use PascalCase naming:class SomeClass
Python functions and methods should use snake_case naming:def my_awesome_function():
Python local variables should use snake_case naming:my_variable = ...
Python variable names that start with a number should be prefixed with 'k':k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G':G_MY_GLOBAL = ...
Python constants should use upper snake_case naming:MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic
Files:
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
**/*.{cpp,h,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
Files:
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
🧠 Learnings (2)
📚 Learning: 2025-07-28T17:06:08.621Z
Learnt from: moraxu
Repo: NVIDIA/TensorRT-LLM PR: 6303
File: tests/integration/test_lists/qa/examples_test_list.txt:494-494
Timestamp: 2025-07-28T17:06:08.621Z
Learning: In TensorRT-LLM testing, it's common to have both CLI flow tests (test_cli_flow.py) and PyTorch API tests (test_llm_api_pytorch.py) for the same model. These serve different purposes: CLI flow tests validate the traditional command-line workflow, while PyTorch API tests validate the newer LLM API backend. Both are legitimate and should coexist.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
📚 Learning: 2025-09-09T09:40:45.658Z
Learnt from: fredricz-20070104
Repo: NVIDIA/TensorRT-LLM PR: 7645
File: tests/integration/test_lists/qa/llm_function_core.txt:648-648
Timestamp: 2025-09-09T09:40:45.658Z
Learning: In TensorRT-LLM test lists, it's common and intentional for the same test to appear in multiple test list files when they serve different purposes (e.g., llm_function_core.txt for comprehensive core functionality testing and llm_function_core_sanity.txt for quick sanity checks). This duplication allows tests to be run in different testing contexts.
Applied to files:
tests/integration/defs/accuracy/test_llm_api_autodeploy.py
🧬 Code graph analysis (1)
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (5)
tests/integration/defs/accuracy/accuracy_core.py (4)
LlmapiAccuracyTestHarness(870-881)MMLU(317-331)evaluate(184-247)evaluate(789-799)tensorrt_llm/_torch/pyexecutor/sampler.py (1)
beam_width(149-152)tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
use_beam_search(525-526)tests/unittest/_torch/modeling/test_modeling_out_of_tree.py (1)
sampling_params(58-59)tensorrt_llm/_torch/auto_deploy/models/factory.py (2)
model(125-127)tokenizer(130-132)
🔇 Additional comments (3)
tests/integration/defs/accuracy/references/mmlu.yaml (1)
345-346: LGTM!The MMLU reference data for nvidia/Nemotron-Super-V3 is correctly formatted and placed. The accuracy value (79.41) matches the expectation from the test implementation.
tests/integration/defs/accuracy/references/gsm8k.yaml (1)
288-289: LGTM!The GSM8K reference data for nvidia/Nemotron-Super-V3 is correctly formatted and placed. The accuracy value (84.38) matches the expectation from the test implementation.
tests/integration/defs/accuracy/test_llm_api_autodeploy.py (1)
260-266: LGTM!The sampling parameters configuration follows the standard pattern used across all test classes in this file, which is appropriate for accuracy evaluation.
Signed-off-by: Chenghao Zhang <[email protected]>
|
/bot run |
|
PR_Github #29029 [ run ] triggered by Bot. Commit: |
|
PR_Github #29029 [ run ] completed with state
|
|
/bot run |
|
PR_Github #29051 [ run ] triggered by Bot. Commit: |
|
PR_Github #29051 [ run ] completed with state
|
|
/bot run |
|
PR_Github #29173 [ run ] triggered by Bot. Commit: |
|
PR_Github #29173 [ run ] completed with state |
Summary by CodeRabbit
New Features
Tests