Skip to content

Conversation

@brb-nv
Copy link
Collaborator

@brb-nv brb-nv commented Dec 18, 2025

Description

Currently, cuda graph functionality for helix parallelism is broken resulting in sporadically poor accuracy in CI. This MR makes the following changes:

  • Use static buffers for helix_position_offsets and helix_is_inactive_rank whether cudagraph is enabled or not.
  • Unify where these tensors are assigned and used. Current assignment and usage is non-uniform and confusing.
  • Update e2e tests with cudagraph parameterization.

Test Coverage

# No cuda graph.
$ pytest tests/integration/defs/accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_auto_dtype_with_helix[cudagraph:none] -s -v

# Without padding.
$ pytest tests/integration/defs/accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_auto_dtype_with_helix[cudagraph:without_padding] -s -v

# With padding.
$ pytest tests/integration/defs/accuracy/test_disaggregated_serving.py::TestDeepSeekV3Lite::test_auto_dtype_with_helix[cudagraph:with_padding] -s -v

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

Release Notes

  • New Features

    • Added Helix parallelism support for distributed attention processing with improved parameter management and state tracking.
  • Refactor

    • Reorganized helix parameter handling through centralized metadata flow for enhanced compatibility with CUDA graphs and distributed configurations.

✏️ Tip: You can customize this high-level summary in your review settings.

@brb-nv brb-nv requested review from a team as code owners December 18, 2025 23:57
@brb-nv brb-nv requested a review from yuxianq December 18, 2025 23:57
@brb-nv brb-nv force-pushed the user/brb/investigate-helix-cuda-graph branch 3 times, most recently from 7cc4b6f to 8fbbde2 Compare December 19, 2025 00:02
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 19, 2025

📝 Walkthrough

Walkthrough

The changes refactor helix parallelism handling to flow through attention metadata instead of direct parameters, introduce helix-aware control flow in the model engine, rename testing flags from enable_unit_test to enable_helix_test, and add helix state tracking to request objects and resource management.

Changes

Cohort / File(s) Summary
Attention Backend Refactoring
tensorrt_llm/_torch/attention_backend/trtllm.py
Added helix state fields (helix_position_offsets, helix_is_inactive_rank) to TrtllmAttentionWrapper and TrtllmAttentionMetadata. Introduced update_helix_param method to copy helix parameters to device buffers. Removed helix parameters from forward and mla_rope_generation signatures, routing them through metadata instead.
MLA Module Updates
tensorrt_llm/_torch/modules/attention.py
Renamed enable_unit_test flag to enable_helix_test throughout MLA class. Updated context and generation code paths to set helix_position_offsets on metadata when enabled. Removed explicit helix parameter passing to backend functions.
Request & Engine State
tensorrt_llm/_torch/pyexecutor/llm_request.py
Added two new attributes (seqlen_this_rank_cp, total_input_len_cp) initialized to self.prompt_len for helix parallelism tracking.
Model Engine Helix Flow
tensorrt_llm/_torch/pyexecutor/model_engine.py
Replaced CP-size-based warmup exit with CP-type-based check (skip for ULYSSES, STAR, HELIX). Added enable_helix to attention metadata from mapping state. Introduced helix state accumulators and update_helix_param invocation. Added CP-type dispatch routing for STAR through specialized path.
Resource Management
tensorrt_llm/_torch/pyexecutor/resource_manager.py
Updated dummy request handling to mark only the last rank as active during CP helix parallelism via py_helix_is_inactive_rank.
Test Infrastructure
tests/integration/defs/accuracy/test_disaggregated_serving.py
Parameterized test_auto_dtype_with_helix with three cuda_graph_config variants (None, padding disabled, padding enabled) instead of hardcoded value.
Helix Integration Tests
tests/unittest/_torch/modules/test_mla_helix.py
Changed kv_cache_tokens_per_block from 64 to 32. Replaced enable_unit_test with enable_helix_test in MLA instantiations. Removed direct helix_is_inactive_rank from metadata construction, now applied via update_helix_param method.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

  • Attention backend refactoring: Significant restructuring of helix parameter flow from direct arguments to metadata-based routing requires careful verification of all call sites and buffer allocation logic.
  • Control flow changes in model engine: Multiple new conditional branches for CP-type dispatch (STAR vs. HELIX), warmup gating logic, and helix state accumulation patterns need careful review.
  • Test expectations alignment: Changes span from module-level flag renaming to executor-level helix state propagation; must verify backward compatibility and correctness of helix state initialization across distributed scenarios.
  • API signature changes: Multiple public method signatures modified (TrtllmAttention.forward, mla_rope_generation, MLA.__init__); requires checking all downstream callers.

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 30.43% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed Title directly relates to the main objective: fixing CUDA graph updates for helix parallelism, which is the core focus of this PR.
Description check ✅ Passed The PR description includes all required sections: a clear explanation of the issue and solution, comprehensive test coverage with specific commands, and a completed checklist.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)

2644-2654: cp_type dispatch now raises for unsupported values (incl. RING); consider falling back to generic path

The new _prepare_inputs cp_type dispatch:

if self.mapping is not None and 'cp_type' in self.mapping.cp_config:
    cp_type = self.mapping.cp_config['cp_type']
    if CpType.STAR == cp_type:
        ...
    elif CpType.HELIX == cp_type:
        pass
    else:
        raise NotImplementedError(...)

means any cp_type other than STAR or HELIX (including CpType.RING, which is already defined) will now crash with NotImplementedError, whereas previously they would have taken the generic _prepare_tp_inputs route.

If RING or other CP modes are present (even experimentally), this is a behavior change and could break existing configs. A safer pattern would be:

  • Handle STAR specially.
  • Let HELIX and all other cp_types fall through to _prepare_tp_inputs, possibly with a warning for truly unsupported types.

For example:

Suggested fallback behavior
-        if self.mapping is not None and 'cp_type' in self.mapping.cp_config:
-            cp_type = self.mapping.cp_config['cp_type']
-            if CpType.STAR == cp_type:
-                return self._prepare_star_attention_inputs(
-                    scheduled_requests, kv_cache_manager, attn_metadata)
-            elif CpType.HELIX == cp_type:
-                # Take the usual route of _prepare_tp_inputs.
-                pass
-            else:
-                raise NotImplementedError(
-                    f"Unsupported cp_type {getattr(cp_type, 'name', cp_type)}.")
+        if self.mapping is not None and 'cp_type' in self.mapping.cp_config:
+            cp_type = self.mapping.cp_config['cp_type']
+            if CpType.STAR == cp_type:
+                return self._prepare_star_attention_inputs(
+                    scheduled_requests, kv_cache_manager, attn_metadata)
+            # HELIX and any other cp types fall through to the generic path for now.
+            # If a new cp_type truly cannot share _prepare_tp_inputs, add a dedicated
+            # branch rather than raising by default.
🧹 Nitpick comments (6)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)

551-558: Helix dummy request rank flagging looks correct; optionally restrict to generation-only

Logic cleanly marks only the last CP rank as active, which matches helix semantics for dummy requests. For clarity, you could consider guarding this under if is_gen and self.mapping.has_cp_helix(): so context-only dummies don’t carry a helix flag they never use, but behavior is already correct.

tensorrt_llm/_torch/pyexecutor/llm_request.py (1)

491-494: Keep Helix length fields consistent when prompt_len is mutated (esp. for CUDA‑graph dummies)

seqlen_this_rank_cp and total_input_len_cp are initialized from self.prompt_len, but prompt_len is later adjusted for CUDA‑graph dummy generation requests in KVCacheManager.add_dummy_requests (set to token_num - 1). That leaves these Helix fields one step ahead of prompt_len for those dummy requests.

Given they’re used in the Helix path of _prepare_tp_inputs to derive position_id and past_seen_token_num, please confirm dummy requests never rely on them, or consider updating seqlen_this_rank_cp / total_input_len_cp alongside prompt_len in the dummy path for consistency.

tensorrt_llm/_torch/pyexecutor/model_engine.py (2)

573-579: cp_type‑based warmup skip for CP: confirm intended coverage

Switching warmup to skip explicitly for cp_type in {CpType.ULYSSES, CpType.STAR, CpType.HELIX} aligns with the new Helix assertion that warmup is never invoked, and avoids relying on cp_size alone.

Please double‑check that:

  • All currently supported CP modes that must not run cuda‑graph / autotuner / torch.compile warmups are covered by this list, and
  • Any future CP types won’t accidentally be skipped or warmed up incorrectly because of this hard‑coded set.

Also, consider formatting the log as a single message (e.g., logger.info("... %s", cp_type.name)) to avoid logging a tuple.


1029-1043: Helix metadata plumbing is coherent; consider reducing cp_allgather cost and tightening assumptions

The end‑to‑end Helix wiring here makes sense:

  • enable_helix is passed into AttentionMetadata based on mapping.has_cp_helix().
  • In _prepare_tp_inputs, Helix generation sequences gather helix_is_inactive_rank and helix_position_offsets and then feed them into attn_metadata.update_helix_param(...).

A few items worth tightening:

  1. cp_allgather inside the hot generation loop

    The cp_allgather + torch.sum(...) check is only used to assert there is exactly one active rank:

    helix_is_inactive_rank_all_ranks = self.dist.cp_allgather(
        request.py_helix_is_inactive_rank)
    ...
    assert torch.sum(helix_is_inactive_rank_all_ranks) == self.mapping.cp_size - 1

    This runs once per (request, beam) step and introduces a cross‑rank collective in the critical path just for an invariant check. You could:

    • Move it outside the beam loop (once per request), and/or
    • Guard it behind a debug flag or if __debug__: so production runs don’t pay the collective cost on every token.
  2. Assumption that dist is always initialized for Helix

    self.mapping.has_cp_helix() implies cp_size > 1, but self.dist is still optional. If a Helix mapping were ever constructed without a corresponding MPIDist, this cp_allgather would throw. If that setup is impossible by design, a brief assertion (e.g. assert self.dist is not None) near the Helix block would make the contract explicit.

  3. Shape expectations for update_helix_param

    helix_is_inactive_rank / helix_position_offsets are only populated for generation_requests (per beam), whereas sequence_lengths / attn_metadata.seq_lens span contexts + extend + first_draft + generation. That’s fine as long as update_helix_param is defined to work on just the generation portion, but it’s an implicit contract.

    It may be safer either to:

    • Document that behavior in the AttentionMetadata.update_helix_param docstring, or
    • Add a simple guard here, e.g. only call update_helix_param when helix_position_offsets is non‑empty, to avoid surprising empty‑list inputs on pure‑context batches.

Overall the logic looks sound; the main concern is avoiding unnecessary collectives in the hot path and making the Helix invariants explicit.

Also applies to: 1659-1719, 2026-2030

tensorrt_llm/_torch/attention_backend/trtllm.py (1)

647-660: Tighten helix buffer handling and validation in metadata/update_helix_param

The overall design (static GPU/CPU-pinned buffers plus update_helix_param for CUDA graphs) looks sound, but a few small robustness gaps are worth addressing:

  • update_helix_param assumes len(helix_position_offsets) <= self.helix_position_offsets_cpu.size(0) and similarly for helix_is_inactive_rank vs. self.helix_is_inactive_rank_cpu. If a caller misconfigures max_num_tokens/max_num_sequences, the slice assignment will throw at runtime. Adding explicit asserts makes failures easier to diagnose and guards against silent misuse.
  • Tests already pass a torch.Tensor (position_ids_gen) into update_helix_param. Using torch.tensor(helix_position_offsets, dtype=torch.int) on a tensor input creates an extra copy (and typically emits a warning). Prefer torch.as_tensor(..., dtype=torch.int, device=self.helix_position_offsets_cpu.device) and the analogous pattern for the bool mask.
  • prepare()’s helix path relies on self.helix_is_inactive_rank_cpu[:self.num_seqs] being fully populated before each call. That’s fine given the new API, but it’s worth documenting in the docstring of update_helix_param (or via a brief comment) that callers must update helix state prior to prepare() when enable_helix is set; otherwise kv_lens will be computed against stale masks.

If you’d like, I can sketch a small refactor of update_helix_param that handles both list and tensor inputs efficiently and adds the bound checks.

Also applies to: 833-859, 865-890, 931-939

tests/unittest/_torch/modules/test_mla_helix.py (1)

462-470: MLA Helix test wiring matches new enable_helix_test/update_helix_param API

  • Using enable_helix_test=True when constructing MLA in both distributed and reference flows keeps Helix-specific adjustments (e.g., RMSNorm eps and output shaping) confined to test scenarios.
  • For generation step 0, creating metadata with enable_helix=True and calling attn_metadata.update_helix_param(helix_is_inactive_rank=..., helix_position_offsets=position_ids_gen) correctly drives the new static-buffer API.
  • Subsequent steps directly reassign kv_cache_params and helix_is_inactive_rank. That’s fine for this test because the Helix mask is constant across steps; in production code, it would be better to rely on update_helix_param to keep CUDA-graph compatibility when masks change over time.

Overall, the test changes look consistent with the backend refactor and should give good coverage of the new Helix+CUDAGraph flow.

Also applies to: 531-536, 684-700

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 799a2ae and 79cd968.

📒 Files selected for processing (7)
  • tensorrt_llm/_torch/attention_backend/trtllm.py (7 hunks)
  • tensorrt_llm/_torch/modules/attention.py (10 hunks)
  • tensorrt_llm/_torch/pyexecutor/llm_request.py (1 hunks)
  • tensorrt_llm/_torch/pyexecutor/model_engine.py (5 hunks)
  • tensorrt_llm/_torch/pyexecutor/resource_manager.py (1 hunks)
  • tests/integration/defs/accuracy/test_disaggregated_serving.py (2 hunks)
  • tests/unittest/_torch/modules/test_mla_helix.py (4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming: some_file.py
Python classes should use PascalCase naming: class SomeClass
Python functions and methods should use snake_case naming: def my_awesome_function():
Python local variables should use snake_case naming: my_variable = ...
Python variable names that start with a number should be prefixed with 'k': k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G': G_MY_GLOBAL = ...
Python constants should use upper snake_case naming: MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic

Files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/modules/attention.py
  • tests/unittest/_torch/modules/test_mla_helix.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,h,cu,cuh,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification

Files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tests/integration/defs/accuracy/test_disaggregated_serving.py
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/modules/attention.py
  • tests/unittest/_torch/modules/test_mla_helix.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (10)
📚 Learning: 2025-12-12T03:27:18.859Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 9655
File: tensorrt_llm/_torch/pyexecutor/sampler.py:3031-3031
Timestamp: 2025-12-12T03:27:18.859Z
Learning: In tensorrt_llm/_torch/pyexecutor/sampler.py, when reviewing code that iterates through requests, ensure it does not convert excessive data into Python lists. Instead, the code should use torch.gather or indexing to gather only the data that will be used in the for loop before converting to Python lists. This minimizes data movement and improves performance.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/attention_backend/trtllm.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-12-12T03:27:08.565Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 9655
File: tensorrt_llm/_torch/pyexecutor/sampler.py:3031-3031
Timestamp: 2025-12-12T03:27:08.565Z
Learning: In files under tensorrt_llm/_torch/pyexecutor, avoid accessing torch.Tensor objects inside for-loops when iterating over requests. Convert batched tensors to Python lists beforehand using tensor.tolist(), and then iterate over those lists. This improves performance by reducing tensor-bound operations inside hot loops. Apply this pattern to similar code paths that process batches to access simple Python data structures (lists) inside loops.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/resource_manager.py
  • tensorrt_llm/_torch/pyexecutor/llm_request.py
  • tensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.

Applied to files:

  • tests/integration/defs/accuracy/test_disaggregated_serving.py
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.

Applied to files:

  • tensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.

Applied to files:

  • tensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.

Applied to files:

  • tensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.

Applied to files:

  • tensorrt_llm/_torch/modules/attention.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.

Applied to files:

  • tensorrt_llm/_torch/modules/attention.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.

Applied to files:

  • tensorrt_llm/_torch/pyexecutor/model_engine.py
🧬 Code graph analysis (5)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)
tensorrt_llm/runtime/model_runner.py (1)
  • mapping (825-826)
tensorrt_llm/mapping.py (2)
  • has_cp_helix (245-247)
  • cp_rank (561-562)
tensorrt_llm/_torch/distributed/communicator.py (3)
  • has_cp_helix (104-105)
  • cp_size (56-57)
  • cp_rank (68-69)
tensorrt_llm/_torch/device_mesh.py (1)
  • cp_rank (84-86)
tensorrt_llm/_torch/attention_backend/trtllm.py (3)
cpp/tensorrt_llm/kernels/mlaKernels.h (1)
  • helix_is_inactive_rank (114-115)
tensorrt_llm/_torch/attention_backend/interface.py (1)
  • get_empty (363-390)
tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
  • cached_token_lens (116-118)
tensorrt_llm/_torch/modules/attention.py (3)
tensorrt_llm/_torch/models/checkpoints/base_weight_mapper.py (1)
  • mapping (162-163)
tensorrt_llm/mapping.py (1)
  • has_cp_helix (245-247)
tensorrt_llm/_torch/distributed/communicator.py (1)
  • has_cp_helix (104-105)
tests/unittest/_torch/modules/test_mla_helix.py (2)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
  • update_helix_param (865-889)
cpp/tensorrt_llm/kernels/mlaKernels.h (2)
  • helix_is_inactive_rank (114-115)
  • helix_position_offsets (110-136)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
tensorrt_llm/_torch/distributed/communicator.py (3)
  • cp_config (108-109)
  • has_cp_helix (104-105)
  • cp_size (56-57)
tensorrt_llm/mapping.py (2)
  • CpType (25-33)
  • has_cp_helix (245-247)
🔇 Additional comments (5)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)

86-87: Helix tensor plumb‐through into TRTLLM attention/MLA looks consistent

The addition of helix_position_offsets/helix_is_inactive_rank to the wrapper, their inclusion in mla_tensor_params, and their sourcing from TrtllmAttentionMetadata line up with the MLA kernel interface (both for the main attention op and mla_rope_generation). I don’t see issues with argument ordering or lifetimes here assuming metadata is updated via update_helix_param before plan()/mla_rope_generation are called.

Also applies to: 211-213, 301-302, 483-484, 1652-1653, 1932-1934

tests/integration/defs/accuracy/test_disaggregated_serving.py (1)

835-851: Helix cudagraph parametrization for DeepSeekV3Lite looks correct

The new cuda_graph_config parametrization (none / without_padding / with_padding) and wiring into gen_server_config cleanly exercises the three intended modes without altering the rest of the test setup. This should help catch Helix–CUDA graph regressions across padding configurations.

Also applies to: 881-881

tests/unittest/_torch/modules/test_mla_helix.py (1)

83-83: Scenario tokens_per_block reduction to 32 aligns KV cache setup with Helix tests

Changing kv_cache_tokens_per_block to 32 and using it consistently for max_tokens and tokens_per_block in KVCacheManager keeps the KV cache sizing logic coherent with the Helix configurations used elsewhere (e.g., the disaggregated DeepSeekV3Lite test). The arithmetic for max_tokens still gives a safe upper bound for all configured ctx_len/batch pairs.

Also applies to: 167-193

tensorrt_llm/_torch/modules/attention.py (2)

703-705: enable_helix_test flag is well-scoped and preserves default behavior

Introducing enable_helix_test as a named flag (replacing the old unit-test knob) and using it only to:

  • select a safe default rms_norm_eps when config.pretrained_config lacks the attribute, and
  • gate Helix-specific test wiring,

keeps the core MLA behavior unchanged for regular configs. The assertions around CP type and mapping remain intact, so non-Helix and non-test scenarios shouldn’t be impacted.

Also applies to: 749-816


1125-1134: Helix test output shaping and context helix_position_offsets hook are subtle but consistent

  • In create_output, expanding the last dimension by self.mapping.cp_size only when enable_helix_test and num_contexts > 0 gives the Helix test harness enough room to store per-CP-rank context activations without affecting non-Helix runs.
  • Setting attn_metadata.helix_position_offsets = position_ids under enable_helix_test in forward_context_default is a targeted test hook to ensure Helix math matches the reference; it’s gated away from normal execution.
  • The final slice in forward back down to num_heads_tp_cp * v_head_dim when enable_helix_test and mapping.has_cp_helix() restores the shape expected by o_proj and mirrors the previous enable_unit_test logic.

Given the tight gating, the extra shape gymnastics should remain invisible outside the Helix test path.

Also applies to: 1372-1377, 2153-2159

@brb-nv brb-nv requested a review from thorjohnsen December 19, 2025 00:08
@brb-nv brb-nv force-pushed the user/brb/investigate-helix-cuda-graph branch from 8fbbde2 to 94112ab Compare December 19, 2025 00:14
@brb-nv
Copy link
Collaborator Author

brb-nv commented Dec 19, 2025

/bot run --disable-fail-fast

Copy link
Collaborator

@thorjohnsen thorjohnsen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29050 [ run ] triggered by Bot. Commit: 94112ab

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29050 [ run ] completed with state SUCCESS. Commit: 94112ab
/LLM/main/L0_MergeRequest_PR pipeline #22269 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@brb-nv brb-nv force-pushed the user/brb/investigate-helix-cuda-graph branch 2 times, most recently from 35c1cc8 to 6d8b2c0 Compare December 19, 2025 22:15
@brb-nv
Copy link
Collaborator Author

brb-nv commented Dec 19, 2025

/bot run --disable-fail-fast

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29182 [ run ] triggered by Bot. Commit: 6d8b2c0

@tensorrt-cicd
Copy link
Collaborator

PR_Github #29182 [ run ] completed with state SUCCESS. Commit: 6d8b2c0
/LLM/main/L0_MergeRequest_PR pipeline #22387 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

@brb-nv brb-nv force-pushed the user/brb/investigate-helix-cuda-graph branch from 2933e93 to 9c76375 Compare December 20, 2025 03:39
@brb-nv brb-nv force-pushed the user/brb/investigate-helix-cuda-graph branch from 4dc0470 to 1e7acc6 Compare December 20, 2025 04:19
@brb-nv
Copy link
Collaborator Author

brb-nv commented Dec 20, 2025

/bot run --disable-fail-fast

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants