-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][feat] Cudagraph updates for helix parallelism #10141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
7cc4b6f to
8fbbde2
Compare
📝 WalkthroughWalkthroughThe changes refactor helix parallelism handling to flow through attention metadata instead of direct parameters, introduce helix-aware control flow in the model engine, rename testing flags from Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
tensorrt_llm/_torch/pyexecutor/model_engine.py (1)
2644-2654: cp_type dispatch now raises for unsupported values (incl. RING); consider falling back to generic pathThe new
_prepare_inputscp_type dispatch:if self.mapping is not None and 'cp_type' in self.mapping.cp_config: cp_type = self.mapping.cp_config['cp_type'] if CpType.STAR == cp_type: ... elif CpType.HELIX == cp_type: pass else: raise NotImplementedError(...)means any
cp_typeother thanSTARorHELIX(includingCpType.RING, which is already defined) will now crash withNotImplementedError, whereas previously they would have taken the generic_prepare_tp_inputsroute.If RING or other CP modes are present (even experimentally), this is a behavior change and could break existing configs. A safer pattern would be:
- Handle
STARspecially.- Let
HELIXand all other cp_types fall through to_prepare_tp_inputs, possibly with a warning for truly unsupported types.For example:
Suggested fallback behavior
- if self.mapping is not None and 'cp_type' in self.mapping.cp_config: - cp_type = self.mapping.cp_config['cp_type'] - if CpType.STAR == cp_type: - return self._prepare_star_attention_inputs( - scheduled_requests, kv_cache_manager, attn_metadata) - elif CpType.HELIX == cp_type: - # Take the usual route of _prepare_tp_inputs. - pass - else: - raise NotImplementedError( - f"Unsupported cp_type {getattr(cp_type, 'name', cp_type)}.") + if self.mapping is not None and 'cp_type' in self.mapping.cp_config: + cp_type = self.mapping.cp_config['cp_type'] + if CpType.STAR == cp_type: + return self._prepare_star_attention_inputs( + scheduled_requests, kv_cache_manager, attn_metadata) + # HELIX and any other cp types fall through to the generic path for now. + # If a new cp_type truly cannot share _prepare_tp_inputs, add a dedicated + # branch rather than raising by default.
🧹 Nitpick comments (6)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (1)
551-558: Helix dummy request rank flagging looks correct; optionally restrict to generation-onlyLogic cleanly marks only the last CP rank as active, which matches helix semantics for dummy requests. For clarity, you could consider guarding this under
if is_gen and self.mapping.has_cp_helix():so context-only dummies don’t carry a helix flag they never use, but behavior is already correct.tensorrt_llm/_torch/pyexecutor/llm_request.py (1)
491-494: Keep Helix length fields consistent when prompt_len is mutated (esp. for CUDA‑graph dummies)
seqlen_this_rank_cpandtotal_input_len_cpare initialized fromself.prompt_len, butprompt_lenis later adjusted for CUDA‑graph dummy generation requests inKVCacheManager.add_dummy_requests(set totoken_num - 1). That leaves these Helix fields one step ahead ofprompt_lenfor those dummy requests.Given they’re used in the Helix path of
_prepare_tp_inputsto deriveposition_idandpast_seen_token_num, please confirm dummy requests never rely on them, or consider updatingseqlen_this_rank_cp/total_input_len_cpalongsideprompt_lenin the dummy path for consistency.tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
573-579: cp_type‑based warmup skip for CP: confirm intended coverageSwitching warmup to skip explicitly for
cp_typein{CpType.ULYSSES, CpType.STAR, CpType.HELIX}aligns with the new Helix assertion that warmup is never invoked, and avoids relying oncp_sizealone.Please double‑check that:
- All currently supported CP modes that must not run cuda‑graph / autotuner / torch.compile warmups are covered by this list, and
- Any future CP types won’t accidentally be skipped or warmed up incorrectly because of this hard‑coded set.
Also, consider formatting the log as a single message (e.g.,
logger.info("... %s", cp_type.name)) to avoid logging a tuple.
1029-1043: Helix metadata plumbing is coherent; consider reducing cp_allgather cost and tightening assumptionsThe end‑to‑end Helix wiring here makes sense:
enable_helixis passed intoAttentionMetadatabased onmapping.has_cp_helix().- In
_prepare_tp_inputs, Helix generation sequences gatherhelix_is_inactive_rankandhelix_position_offsetsand then feed them intoattn_metadata.update_helix_param(...).A few items worth tightening:
cp_allgather inside the hot generation loop
The
cp_allgather+torch.sum(...)check is only used to assert there is exactly one active rank:helix_is_inactive_rank_all_ranks = self.dist.cp_allgather( request.py_helix_is_inactive_rank) ... assert torch.sum(helix_is_inactive_rank_all_ranks) == self.mapping.cp_size - 1This runs once per (request, beam) step and introduces a cross‑rank collective in the critical path just for an invariant check. You could:
- Move it outside the
beamloop (once per request), and/or- Guard it behind a debug flag or
if __debug__:so production runs don’t pay the collective cost on every token.Assumption that
distis always initialized for Helix
self.mapping.has_cp_helix()impliescp_size > 1, butself.distis still optional. If a Helix mapping were ever constructed without a correspondingMPIDist, thiscp_allgatherwould throw. If that setup is impossible by design, a brief assertion (e.g.assert self.dist is not None) near the Helix block would make the contract explicit.Shape expectations for
update_helix_param
helix_is_inactive_rank/helix_position_offsetsare only populated forgeneration_requests(per beam), whereassequence_lengths/attn_metadata.seq_lensspan contexts + extend + first_draft + generation. That’s fine as long asupdate_helix_paramis defined to work on just the generation portion, but it’s an implicit contract.It may be safer either to:
- Document that behavior in the
AttentionMetadata.update_helix_paramdocstring, or- Add a simple guard here, e.g. only call
update_helix_paramwhenhelix_position_offsetsis non‑empty, to avoid surprising empty‑list inputs on pure‑context batches.Overall the logic looks sound; the main concern is avoiding unnecessary collectives in the hot path and making the Helix invariants explicit.
Also applies to: 1659-1719, 2026-2030
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
647-660: Tighten helix buffer handling and validation in metadata/update_helix_paramThe overall design (static GPU/CPU-pinned buffers plus
update_helix_paramfor CUDA graphs) looks sound, but a few small robustness gaps are worth addressing:
update_helix_paramassumeslen(helix_position_offsets) <= self.helix_position_offsets_cpu.size(0)and similarly forhelix_is_inactive_rankvs.self.helix_is_inactive_rank_cpu. If a caller misconfiguresmax_num_tokens/max_num_sequences, the slice assignment will throw at runtime. Adding explicit asserts makes failures easier to diagnose and guards against silent misuse.- Tests already pass a
torch.Tensor(position_ids_gen) intoupdate_helix_param. Usingtorch.tensor(helix_position_offsets, dtype=torch.int)on a tensor input creates an extra copy (and typically emits a warning). Prefertorch.as_tensor(..., dtype=torch.int, device=self.helix_position_offsets_cpu.device)and the analogous pattern for the bool mask.prepare()’s helix path relies onself.helix_is_inactive_rank_cpu[:self.num_seqs]being fully populated before each call. That’s fine given the new API, but it’s worth documenting in the docstring ofupdate_helix_param(or via a brief comment) that callers must update helix state prior toprepare()whenenable_helixis set; otherwisekv_lenswill be computed against stale masks.If you’d like, I can sketch a small refactor of
update_helix_paramthat handles both list and tensor inputs efficiently and adds the bound checks.Also applies to: 833-859, 865-890, 931-939
tests/unittest/_torch/modules/test_mla_helix.py (1)
462-470: MLA Helix test wiring matches new enable_helix_test/update_helix_param API
- Using
enable_helix_test=Truewhen constructing MLA in both distributed and reference flows keeps Helix-specific adjustments (e.g., RMSNorm eps and output shaping) confined to test scenarios.- For generation step 0, creating metadata with
enable_helix=Trueand callingattn_metadata.update_helix_param(helix_is_inactive_rank=..., helix_position_offsets=position_ids_gen)correctly drives the new static-buffer API.- Subsequent steps directly reassign
kv_cache_paramsandhelix_is_inactive_rank. That’s fine for this test because the Helix mask is constant across steps; in production code, it would be better to rely onupdate_helix_paramto keep CUDA-graph compatibility when masks change over time.Overall, the test changes look consistent with the backend refactor and should give good coverage of the new Helix+CUDAGraph flow.
Also applies to: 531-536, 684-700
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (7)
tensorrt_llm/_torch/attention_backend/trtllm.py(7 hunks)tensorrt_llm/_torch/modules/attention.py(10 hunks)tensorrt_llm/_torch/pyexecutor/llm_request.py(1 hunks)tensorrt_llm/_torch/pyexecutor/model_engine.py(5 hunks)tensorrt_llm/_torch/pyexecutor/resource_manager.py(1 hunks)tests/integration/defs/accuracy/test_disaggregated_serving.py(2 hunks)tests/unittest/_torch/modules/test_mla_helix.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: Code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces. Do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used
Python files should use snake_case naming:some_file.py
Python classes should use PascalCase naming:class SomeClass
Python functions and methods should use snake_case naming:def my_awesome_function():
Python local variables should use snake_case naming:my_variable = ...
Python variable names that start with a number should be prefixed with 'k':k_99th_percentile = ...
Python global variables should use upper snake_case with prefix 'G':G_MY_GLOBAL = ...
Python constants should use upper snake_case naming:MY_CONSTANT = ...
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings in Python for classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except to the smallest set of errors possible
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible, using the else block for logic
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.pytests/integration/defs/accuracy/test_disaggregated_serving.pytensorrt_llm/_torch/pyexecutor/llm_request.pytensorrt_llm/_torch/attention_backend/trtllm.pytensorrt_llm/_torch/modules/attention.pytests/unittest/_torch/modules/test_mla_helix.pytensorrt_llm/_torch/pyexecutor/model_engine.py
**/*.{cpp,h,cu,cuh,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code should contain an NVIDIA copyright header that includes the year of its latest meaningful modification
Files:
tensorrt_llm/_torch/pyexecutor/resource_manager.pytests/integration/defs/accuracy/test_disaggregated_serving.pytensorrt_llm/_torch/pyexecutor/llm_request.pytensorrt_llm/_torch/attention_backend/trtllm.pytensorrt_llm/_torch/modules/attention.pytests/unittest/_torch/modules/test_mla_helix.pytensorrt_llm/_torch/pyexecutor/model_engine.py
🧠 Learnings (10)
📚 Learning: 2025-12-12T03:27:18.859Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 9655
File: tensorrt_llm/_torch/pyexecutor/sampler.py:3031-3031
Timestamp: 2025-12-12T03:27:18.859Z
Learning: In tensorrt_llm/_torch/pyexecutor/sampler.py, when reviewing code that iterates through requests, ensure it does not convert excessive data into Python lists. Instead, the code should use torch.gather or indexing to gather only the data that will be used in the for loop before converting to Python lists. This minimizes data movement and improves performance.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.py
📚 Learning: 2025-08-19T12:45:11.997Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 7033
File: tensorrt_llm/_torch/pyexecutor/model_engine.py:0-0
Timestamp: 2025-08-19T12:45:11.997Z
Learning: In tensorrt_llm/_torch/pyexecutor/model_engine.py, DoRA (Delta Orthogonal Rank Adaptation) functionality was removed from the PyTorch flow to eliminate issues with inverted DoRA detection logic. The original is_dora condition was checking if scaling_vec_pointer == 0, which was potentially incorrect.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/attention_backend/trtllm.pytensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-12-12T03:27:08.565Z
Learnt from: tongyuantongyu
Repo: NVIDIA/TensorRT-LLM PR: 9655
File: tensorrt_llm/_torch/pyexecutor/sampler.py:3031-3031
Timestamp: 2025-12-12T03:27:08.565Z
Learning: In files under tensorrt_llm/_torch/pyexecutor, avoid accessing torch.Tensor objects inside for-loops when iterating over requests. Convert batched tensors to Python lists beforehand using tensor.tolist(), and then iterate over those lists. This improves performance by reducing tensor-bound operations inside hot loops. Apply this pattern to similar code paths that process batches to access simple Python data structures (lists) inside loops.
Applied to files:
tensorrt_llm/_torch/pyexecutor/resource_manager.pytensorrt_llm/_torch/pyexecutor/llm_request.pytensorrt_llm/_torch/pyexecutor/model_engine.py
📚 Learning: 2025-08-26T09:37:10.463Z
Learnt from: jiaganc
Repo: NVIDIA/TensorRT-LLM PR: 7031
File: tensorrt_llm/bench/dataclasses/configuration.py:90-104
Timestamp: 2025-08-26T09:37:10.463Z
Learning: In TensorRT-LLM's bench configuration, the `get_pytorch_perf_config()` method returns `self.pytorch_config` which is a Dict[str, Any] that can contain default values including `cuda_graph_config`, making the fallback `llm_args["cuda_graph_config"]` safe to use.
Applied to files:
tests/integration/defs/accuracy/test_disaggregated_serving.py
📚 Learning: 2025-08-14T15:43:23.107Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: tensorrt_llm/_torch/attention_backend/trtllm.py:259-262
Timestamp: 2025-08-14T15:43:23.107Z
Learning: In TensorRT-LLM's attention backend, tensor parameters in the plan() method are assigned directly without validation (dtype, device, contiguity checks). This maintains consistency across all tensor inputs and follows the pattern of trusting callers to provide correctly formatted tensors.
Applied to files:
tensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-08-15T06:46:53.813Z
Learnt from: eopXD
Repo: NVIDIA/TensorRT-LLM PR: 6767
File: cpp/tensorrt_llm/batch_manager/kvCacheManager.cpp:0-0
Timestamp: 2025-08-15T06:46:53.813Z
Learning: In the TensorRT-LLM KV cache manager, SWA (Sliding Window Attention) combined with beam search is currently in a broken/non-functional state and is planned for future rework. During preparatory refactoring phases, code related to SWA+beam search may intentionally remain in a non-working state until the broader rework is completed.
Applied to files:
tensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-08-14T15:38:01.771Z
Learnt from: MatthiasKohl
Repo: NVIDIA/TensorRT-LLM PR: 6904
File: cpp/tensorrt_llm/pybind/thop/bindings.cpp:55-57
Timestamp: 2025-08-14T15:38:01.771Z
Learning: In TensorRT-LLM Python bindings, tensor parameter collections like mla_tensor_params and spec_decoding_tensor_params are kept as required parameters without defaults to maintain API consistency, even when it might affect backward compatibility.
Applied to files:
tensorrt_llm/_torch/attention_backend/trtllm.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation.
Applied to files:
tensorrt_llm/_torch/modules/attention.py
📚 Learning: 2025-09-29T15:14:28.503Z
Learnt from: amitz-nv
Repo: NVIDIA/TensorRT-LLM PR: 8063
File: tensorrt_llm/lora_manager.py:1080-1112
Timestamp: 2025-09-29T15:14:28.503Z
Learning: In tensorrt_llm/lora_manager.py, when calculating part_sizes for attn_qkv fused LoRA modules, the sizes are correctly multiplied by tp_size because model_config.num_heads and model_config.num_kv_heads are already divided by tp_size (per-TP-rank values), so multiplication is needed to get the original full concatenated dimension size. The interleave_fused_lora_weights_for_tp function provides proper validation with asserts for total size and TP divisibility.
Applied to files:
tensorrt_llm/_torch/modules/attention.py
📚 Learning: 2025-08-26T06:07:02.166Z
Learnt from: shaharmor98
Repo: NVIDIA/TensorRT-LLM PR: 7231
File: tensorrt_llm/_torch/pyexecutor/_util.py:504-509
Timestamp: 2025-08-26T06:07:02.166Z
Learning: In tensorrt_llm/_torch/pyexecutor/_util.py, when calling model_engine.set_lora_model_config(), pass model_binding_config.mlp_hidden_size directly without multiplying by mapping.tp_size, as the mlp_hidden_size from get_bindings_model_config() is already the per-TP rank value needed for LoRA weight packaging.
Applied to files:
tensorrt_llm/_torch/pyexecutor/model_engine.py
🧬 Code graph analysis (5)
tensorrt_llm/_torch/pyexecutor/resource_manager.py (4)
tensorrt_llm/runtime/model_runner.py (1)
mapping(825-826)tensorrt_llm/mapping.py (2)
has_cp_helix(245-247)cp_rank(561-562)tensorrt_llm/_torch/distributed/communicator.py (3)
has_cp_helix(104-105)cp_size(56-57)cp_rank(68-69)tensorrt_llm/_torch/device_mesh.py (1)
cp_rank(84-86)
tensorrt_llm/_torch/attention_backend/trtllm.py (3)
cpp/tensorrt_llm/kernels/mlaKernels.h (1)
helix_is_inactive_rank(114-115)tensorrt_llm/_torch/attention_backend/interface.py (1)
get_empty(363-390)tensorrt_llm/_torch/attention_backend/flashinfer.py (1)
cached_token_lens(116-118)
tensorrt_llm/_torch/modules/attention.py (3)
tensorrt_llm/_torch/models/checkpoints/base_weight_mapper.py (1)
mapping(162-163)tensorrt_llm/mapping.py (1)
has_cp_helix(245-247)tensorrt_llm/_torch/distributed/communicator.py (1)
has_cp_helix(104-105)
tests/unittest/_torch/modules/test_mla_helix.py (2)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
update_helix_param(865-889)cpp/tensorrt_llm/kernels/mlaKernels.h (2)
helix_is_inactive_rank(114-115)helix_position_offsets(110-136)
tensorrt_llm/_torch/pyexecutor/model_engine.py (2)
tensorrt_llm/_torch/distributed/communicator.py (3)
cp_config(108-109)has_cp_helix(104-105)cp_size(56-57)tensorrt_llm/mapping.py (2)
CpType(25-33)has_cp_helix(245-247)
🔇 Additional comments (5)
tensorrt_llm/_torch/attention_backend/trtllm.py (1)
86-87: Helix tensor plumb‐through into TRTLLM attention/MLA looks consistentThe addition of
helix_position_offsets/helix_is_inactive_rankto the wrapper, their inclusion inmla_tensor_params, and their sourcing fromTrtllmAttentionMetadataline up with the MLA kernel interface (both for the main attention op andmla_rope_generation). I don’t see issues with argument ordering or lifetimes here assuming metadata is updated viaupdate_helix_parambeforeplan()/mla_rope_generationare called.Also applies to: 211-213, 301-302, 483-484, 1652-1653, 1932-1934
tests/integration/defs/accuracy/test_disaggregated_serving.py (1)
835-851: Helix cudagraph parametrization for DeepSeekV3Lite looks correctThe new
cuda_graph_configparametrization (none / without_padding / with_padding) and wiring intogen_server_configcleanly exercises the three intended modes without altering the rest of the test setup. This should help catch Helix–CUDA graph regressions across padding configurations.Also applies to: 881-881
tests/unittest/_torch/modules/test_mla_helix.py (1)
83-83: Scenario tokens_per_block reduction to 32 aligns KV cache setup with Helix testsChanging
kv_cache_tokens_per_blockto 32 and using it consistently formax_tokensandtokens_per_blockinKVCacheManagerkeeps the KV cache sizing logic coherent with the Helix configurations used elsewhere (e.g., the disaggregated DeepSeekV3Lite test). The arithmetic formax_tokensstill gives a safe upper bound for all configuredctx_len/batchpairs.Also applies to: 167-193
tensorrt_llm/_torch/modules/attention.py (2)
703-705: enable_helix_test flag is well-scoped and preserves default behaviorIntroducing
enable_helix_testas a named flag (replacing the old unit-test knob) and using it only to:
- select a safe default
rms_norm_epswhenconfig.pretrained_configlacks the attribute, and- gate Helix-specific test wiring,
keeps the core MLA behavior unchanged for regular configs. The assertions around CP type and mapping remain intact, so non-Helix and non-test scenarios shouldn’t be impacted.
Also applies to: 749-816
1125-1134: Helix test output shaping and context helix_position_offsets hook are subtle but consistent
- In
create_output, expanding the last dimension byself.mapping.cp_sizeonly whenenable_helix_test and num_contexts > 0gives the Helix test harness enough room to store per-CP-rank context activations without affecting non-Helix runs.- Setting
attn_metadata.helix_position_offsets = position_idsunderenable_helix_testinforward_context_defaultis a targeted test hook to ensure Helix math matches the reference; it’s gated away from normal execution.- The final slice in
forwardback down tonum_heads_tp_cp * v_head_dimwhenenable_helix_testandmapping.has_cp_helix()restores the shape expected byo_projand mirrors the previous enable_unit_test logic.Given the tight gating, the extra shape gymnastics should remain invisible outside the Helix test path.
Also applies to: 1372-1377, 2153-2159
8fbbde2 to
94112ab
Compare
|
/bot run --disable-fail-fast |
thorjohnsen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
PR_Github #29050 [ run ] triggered by Bot. Commit: |
|
PR_Github #29050 [ run ] completed with state
|
35c1cc8 to
6d8b2c0
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #29182 [ run ] triggered by Bot. Commit: |
|
PR_Github #29182 [ run ] completed with state
|
2933e93 to
9c76375
Compare
Signed-off-by: Balaram Buddharaju <[email protected]>
4dc0470 to
1e7acc6
Compare
|
/bot run --disable-fail-fast |
Description
Currently, cuda graph functionality for helix parallelism is broken resulting in sporadically poor accuracy in CI. This MR makes the following changes:
helix_position_offsetsandhelix_is_inactive_rankwhether cudagraph is enabled or not.Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
Release Notes
New Features
Refactor
✏️ Tip: You can customize this high-level summary in your review settings.