Skip to content

Releases: NVIDIA/TensorRT-LLM

v1.1.0rc2.post1

06 Sep 00:06
9d6e87a
Compare
Choose a tag to compare
v1.1.0rc2.post1 Pre-release
Pre-release

Announcement Highlights:

  • API
    • Update TargetInfo to accommodate CP in disagg (#7224)
  • Benchmark
    • Minor fixes to slurm and benchmark scripts (#7453)
  • Feature
    • Support DeepGEMM swap-AB on sm100 (#7355)
    • Merge add sparse exp and shared exp into local re… (#7422)
    • Add batch waiting when scheduling (#7287)
    • Reuse pytorch memory segments occupied by cudagraph pool (#7457)
    • Complete the last missing allreduce op in Llama3/4 (#7420)
  • Documentation
    • Exposing the ADP balance strategy tech blog (#7380)
    • Update Dynasor paper info (#7137)
    • store blog 10 media via lfs (#7375)

What's Changed

  • [None][doc] Exposing the ADP balance strategy tech blog by @juney-nvidia in #7380
  • [None][feat] Update TargetInfo to accommodate CP in disagg by @brb-nv in #7224
  • [None][docs] Update Dynasor paper info by @AndyDai-nv in #7137
  • [None] [fix] store blog 10 media via lfs by @Funatiq in #7375
  • [TRTLLM-7250][fix] Add failed cases into waives.txt by @xinhe-nv in #7342
  • [None][chore] bump version to 1.1.0rc2.post1 by @litaotju in #7396
  • [TRTLLM-6747][feat] Merge add sparse exp and shared exp into local re… by @zongfeijing in #7422
  • [None] [fix] Fix nsys in slurm scripts by @kaiyux in #7409
  • [None][feat] Support DeepGEMM swap-AB on sm100 by @Barry-Delaney in #7355
  • [None] [fix] Minor fixes to slurm and benchmark scripts by @kaiyux in #7453
  • [None][fix] Fix possible mpi broadcast and gather issue on large object by @dongxuy04 in #7507
  • [TRTLLM-7008][fix] Add automatic shared memory delete if already exist by @dongxuy04 in #7377
  • [None][ci] Cherry-pick some improvements for Slurm CI setup from main branch by @chzblych in #7479
  • [https://nvbugs/5481434][feat] Reuse pytorch memory segments occupied by cudagraph pool by @HuiGao-NV in #7457
  • [None][fix] Update DG side branch name by @Barry-Delaney in #7491
  • [None][fix] Update DG commit by @Barry-Delaney in #7534
  • [None][fix] Fix a typo in the Slurm CI codes (#7485) by @chzblych in #7538
  • [https://nvbugs/5488582][fix] Avoid unexpected Triton recompilation in DG fused_moe. by @hyukn in #7495
  • [None][fix] Cherry-pick 6850: Complete the last missing allreduce op in Llama3/4. by @hyukn in #7420
  • [None][opt] Add batch waiting when scheduling by @yunruis in #7287
  • [https://nvbugs/5485325][fix] Add a postprocess to the model engine to fix the CUDA graph warmup issue when using speculative decoding by @lfr-0531 in #7373
  • [None][fix] Cherry-Pick MNNVLAllreduce Fixes into release/1.1.0rc2 branch by @timlee0212 in #7487

New Contributors

Full Changelog: v1.1.0rc2...v1.1.0rc2.post1

v1.1.0rc3

04 Sep 08:24
e81c50d
Compare
Choose a tag to compare
v1.1.0rc3 Pre-release
Pre-release

Announcement Highlights:

  • Model Support
    • Add fp8 support for Mistral Small 3.1 (#6731)
  • Benchmark
    • add benchmark TRT flow test for MIG (#6884)
    • Mistral Small 3.1 accuracy tests (#6909)
  • Feature
    • Update TargetInfo to accommodate CP in disagg (#7224)
    • Merge add sparse exp and shared exp into local reduction (#7369)
    • Support NVFP4 KV Cache (#6244)
    • Allocate MoE workspace only when necessary (release/1.0 retargeted) (#6955)
    • Implement capturable drafting loops for speculation (#7100)
    • Revert phi4-mm aggregate mode (#6907)
    • Complete the last missing allreduce op in Llama3/4. (#6850)
  • Documentation
    • Exposing the ADP balance strategy tech blog (#7380)
    • Update Dynasor paper info (#7137)
    • Add docs for Gemma3 VLMs (#6880)
    • add legacy section for tensorrt engine (#6724)
    • Update DeepSeek example doc (#7358)

What's Changed

New Contributors

Full Changelog: v1.1.0rc2...v1.1.0rc3

v1.1.0rc2

31 Aug 02:22
15ec2b8
Compare
Choose a tag to compare
v1.1.0rc2 Pre-release
Pre-release

Announcement Highlights:

  • Model Support

    • Refactor llama4 for multimodal encoder IFB (#6844)
  • API

    • Add standalone multimodal encoder (#6743)
    • Enable Cross-Attention to use XQA kernels for Whisper (#7035)
    • Enable nanobind as the default binding library (#6608)
    • trtllm-serve + autodeploy integration (#7141)
    • Chat completions API for gpt-oss (#7261)
    • KV Cache Connector API (#7228)
    • Create PyExecutor from TorchLlmArgs Part 1 (#7105)
    • TP Sharding read from the model config (#6972)
  • Benchmark

    • add llama4 tp4 tests (#6989)
    • add test_multi_nodes_eval tests (#7108)
    • nsys profile output kernel classifier (#7020)
    • add kv cache size in bench metric and fix failed cases (#7160)
    • add perf metrics endpoint to openai server and openai disagg server (#6985)
    • add gpt-osss tests to sanity list (#7158)
    • add l20 specific qa test list (#7067)
    • Add beam search CudaGraph + Overlap Scheduler tests (#7326)
    • Update qwen3 timeout to 60 minutes (#7200)
    • Update maxnt of llama_v3.2_1b bench (#7279)
    • Improve performance of PyTorchModelEngine._get_lora_params_from_requests (#7033)
    • Accelerate global scale calculations for deepEP fp4 combine (#7126)
    • Remove and fuse some element-wise ops in the ds-r1-fp8 model (#7238)
    • Balance the request based on number of tokens in AttentionDP (#7183)
    • Wrap the swiglu into custom op to avoid redundant device copy (#7021)
  • Feature

    • Add QWQ-32b torch test (#7284)
    • Fix llama4 multimodal by skipping request validation (#6957)
    • Add group attention pattern for solar-pro-preview (#7054)
    • Add Mistral Small 3.1 multimodal in Triton Backend (#6714)
    • Update lora for phi4-mm (#6817)
    • refactor the CUDA graph runner to manage all CUDA graphs (#6846)
    • Enable chunked prefill for Nemotron-H (#6334)
    • Add customized default routing method (#6818)
    • Testing cache transmission functionality in Python (#7025)
    • Simplify decoder state initialization for speculative decoding (#6869)
    • Support MMMU for multimodal models (#6828)
    • Deepseek: Start Eagle work (#6210)
    • Optimize and refactor alltoall in WideEP (#6973)
    • Apply AutoTuner to fp8_block_scale_deep_gemm to trigger JIT ahead of time (#7113)
    • Hopper Fp8 context mla (#7116)
    • Padding for piecewise cudagraph (#6750)
    • Add low precision all2all for mnnvl (#7155)
    • Use numa to bind CPU (#7304)
    • Skip prefetching consolidated safetensors when appropriate (#7013)
    • Unify sampler handle logits implementation (#6867)
    • Move fusion, kvcache, and compile to modular inference optimizer (#7057)
    • Make finalize fusion part of the tactic selection logic (#6915)
    • Fuse slicing into MoE (#6728)
    • Add logging for OAI disagg server (#7232)
  • Documentation

    • Update gpt-oss deployment guide to latest release image (#7101)
    • update stale link for AutoDeploy (#7135)
    • Add GPT-OSS Deployment Guide into official doc site (#7143)
    • Refine GPT-OSS doc (#7180)
    • update feature_combination_matrix doc (#6691)
    • update disagg doc about UCX_MAX_RNDV_RAILS (#7205)
    • Display tech blog for nvidia.github.io domain (#7241)
    • Updated blog9_Deploying_GPT_OSS_on_TRTLLM (#7260)
    • Update autodeploy README.md, deprecate lm_eval in examples folder (#7233)
    • add adp balance blog (#7213)
    • fix doc formula (#7367)
    • update disagg readme and scripts for pipeline parallelism (#6875)

What's Changed

  • [None][fix] Fix assertion errors of quantization when using online EPLB by @jinyangyuan-nvidia in #6922
  • [None][autodeploy] Add group attention pattern that supports attention masks by @Fridah-nv in #7054
  • [None][chore] unwaive test_disaggregated_genbs1 by @bo-nv in #6944
  • [None][fix] fix llmapi import error by @crazydemo in #7030
  • [TRTLLM-7326][feat] Add standalone multimodal encoder by @chang-l in #6743
  • [None][infra] update feature_combination_matrix of disaggregated and chunked prefill by @leslie-fang25 in #6661
  • [TRTLLM-7205][feat] add llama4 tp4 tests by @xinhe-nv in #6989
  • [None][infra] "[TRTLLM-6960][fix] enable scaled_mm tests (#6936)" by @Tabrizian in #7059
  • [TRTLLM-6341][chore] Preliminary refactors on the kv cache manager before supporting swa kv cache reuse by @eopXD in #6767
  • [None][fix] fix scaffolding dynasor test by @dc3671 in #7070
  • [None][chore] Update namelist in blossom-ci by @karljang in #7015
  • [None][ci] move unittests to sub-directories by @Funatiq in #6635
  • [None][infra] Waive failed tests on main branch 8/20 by @EmmaQiaoCh in #7092
  • [None][fix] Fix W4A8 MoE kernel issue by @yuhyao in #7072
  • [TRTLLM-7348] [feat] Enable Cross-Attention to use XQA kernels for Whisper by @DomBrown in #7035
  • [None][chore] Only check the bindings lib for current build by @liji-nv in #7026
  • [None][ci] move some tests of b200 to post merge by @QiJune in #7093
  • [https://nvbugs/5457489][fix] unwaive some tests by @byshiue in #6991
  • [TRTLLM-6771][feat] Support MMMU for multimodal models by @yechank-nvidia in #6828
  • [None][fix] Fix llama4 multimodal by skipping request validation by @chang-l in #6957
  • [None][infra] Upgrade UCX to v1.19.x and NIXL to 0.5.0 by @BatshevaBlack in #7024
  • [None][fix] update accelerate dependency to 1.7+ for AutoDeploy by @Fridah-nv in #7077
  • [None][fix] Fix const modifier inconsistency in log function declaration/implementation by @Fan-Yunfan in #6679
  • [None][chore] waive failed cases on H100 by @xinhe-nv in #7084
  • [None][fix] Use safeInitRowMax instead of fp32_lowest to avoid NaN by @lowsfer in #7087
  • [https://nvbugs/5443039][fix] Fix AutoDeploy pattern matcher for torch 2.8 by @Fridah-nv in #7076
  • [https://nvbugs/5437405][fix] qwen3 235b eagle3 ci by @byshiue in #7000
  • [None][doc] Update gpt-oss deployment guide to latest release image by @farshadghodsian in #7101
  • [https://nvbugs/5392414] [fix] Add customized default routing method by @ChristinaZ in #6818
  • [https://nvbugs/5453827][fix] Fix RPATH of th_common shared library to find pip-installed NCCL by @tongyuantongyu in #6984
  • [None][chore] No-op changes to support context parallelism in disaggregated serving later by @brb-nv in #7063
  • [https://nvbugs/5394409][feat] Support Mistral Small 3.1 multimodal in Triton Backend by @dbari in #6714
  • [None][infra] Waive failed case for main branch 08/21 by @EmmaQiaoCh in #7129
  • [#4403][refactor] Move fusion, kvcache, and compile to modular inference optimizer by @Fridah-nv in #7057
  • [None][perf] Make finalize fusion part of the tactic selection logic by @djns99 in #6915
  • [None][chore] Mass integration of release/1.0 by @dominicshanshan in #6864
  • [None][docs] update stale link for AutoDeploy by @suyoggupta in #7135
  • [TRTLLM-6825][fix] Update lora for phi4-mm by @Wanli-Jiang in #6817
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #7109
  • [None][fix] Fix mm_placholder_counts extraction issue. by @hyukn in #7118
  • [TRTLLM-7155][feat] Unify sampler handle logits implementation. by @dcampora in #6867
  • [TRTLLM-5801][infra] Add more RTX Pro 6000 test stages by @EmmaQiaoCh in #5126
  • [None][feat] Enable nanobind as the default binding library by @Linda-Stadter in #6608
  • [TRTLLM-7321][doc] Add GPT-OSS Deployment Guide into official doc site by @dongfengy in #7143
  • [TRTLLM-7245][feat] add test_multi_nodes_eval tests by @xinhe-nv in #7108
  • [None][ci] move all B200 TensorRT test cases to post merge by @QiJune in #7165
  • [None][chore] Bump version to 1.1.0rc2 by @yiqingy0 in #7167
  • [#7136][feat] trtllm-serve + autodeploy integration by @suyoggupta in #7141
  • [TRTLLM-4921][feat] Enable chunked prefill for Nemotron-H by @tomeras91 in #6334
  • [None][refactor] Simplify decoder state initialization for speculative decoding by @Funatiq in #6869
  • [None][feat] Deepseek: Start Eag...
Read more

v1.1.0rc1

22 Aug 10:02
7334f93
Compare
Choose a tag to compare
v1.1.0rc1 Pre-release
Pre-release

Announcement Highlights:

  • Model Support

    • Add Tencent HunYuanMoEV1 model support (#5521)
    • Support Yarn on Qwen3 (#6785)
  • API

    • BREAKING CHANGE: Introduce sampler_type, detect sampler according to options (#6831)
    • Introduce sampler options in trtllm bench (#6855)
    • Support accurate device iter time (#6906)
    • Add batch wait timeout in fetching requests (#6923)
  • Benchmark

    • Add accuracy evaluation for AutoDeploy (#6764)
    • Add accuracy test for context and generation workers with different models (#6741)
    • Add DeepSeek-R1 FP8 accuracy tests on Blackwell (#6710)
    • Add NIM Related Cases [StarCoder2_7B] and [Codestral_22B_V01] (#6939)
    • Add NIM Related Cases Part 1 (#6684)
  • Feature

    • Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow (#6629)
    • Add single block version renormalized routing kernel (#6756)
    • Use Separate QKV Input Layout for Context MLA (#6538)
    • Enable accuracy test for MTP and chunked prefill (#6314)
  • Documentation

    • Update gpt-oss doc on MoE support matrix (#6908)
    • Modify the description for MLA chunked context (#6929)
    • Update wide-ep doc (#6933)
    • Update gpt oss doc (#6954)
    • Add more documents for large scale EP (#7029)
    • Add documentation for relaxed test threshold (#6997)

What's Changed

  • [https://nvbugs/5455651][fix] Make ngram use XQA attention on Blackwell by @mikeiovine in #6873
  • [https://nvbugs/5441714][chore] remove skip on disagg n-gram test by @raayandhar in #6872
  • [None] [feat] Add Tencent HunYuanMoEV1 model support by @qianbiaoxiang in #5521
  • [None][chore] Add tests for non-existent and completed request cancellation by @achartier in #6840
  • [None][doc] Update gpt-oss doc on MoE support matrix by @hlu1 in #6908
  • [https://nvbugs/5394685][fix] using static scheduler 2CTA MLA as WAR for an accuracy issue by @PerkzZheng in #6896
  • [https://nvbugs/5437106][fix] Add L4 Scout benchmarking WAR option in deploy guide by @JunyiXu-nv in #6829
  • [None][fix] Fix the issue of responsibility boundary between the assert and tllmException files by @Fan-Yunfan in #6723
  • [None][fix] Correct reporting of torch_dtype for ModelConfig class. by @FrankD412 in #6800
  • [None][fix] Fix perfect router. by @bobboli in #6797
  • [https://nvbugs/5415862][fix] Update cublas as 12.9.1 and cuda memory alignment as 256 by @Wanli-Jiang in #6501
  • [None][fix] Update tests to use standardized uppercase backend identifiers by @bo-nv in #6921
  • [TRTLLM-7141][infra] Use repo mirrors to avoid intermittent network failures by @chzblych in #6836
  • [None][doc] Modify the description for mla chunked context by @jmydurant in #6929
  • [None][chore] Add failed cases into waives.txt by @xinhe-nv in #6914
  • [None][chore] add a EditorConfig config by @zhenhuaw-me in #6897
  • [https://nvbugs/5451373][fix] : Fix the accuracy issue when using FP8 context MLA by @peaceh-nv in #6881
  • [https://nvbugs/5405041][fix] Update wide-ep doc by @qiaoxj07 in #6933
  • [None] [chore] Mamba cache in separate file by @tomeras91 in #6796
  • [https://nvbugs/5427801][fix] Torch compile support for Llama4 and Ea… by @liji-nv in #6858
  • [https://nvbugs/5394685][fix] proper fix for the accuracy issue in 2CTA MLA kernels by @PerkzZheng in #6941
  • [https://nvbugs/5394392][fix] Enlarge scheduler capacity under disagg bs == 1 by @yifeizhang-c in #6537
  • [None][test] Add accuracy evaluation for AutoDeploy by @ajrasane in #6764
  • [None][fix] Make TP working for Triton MOE (in additional to EP we are using) by @dongfengy in #6722
  • [TRTLLM-5863][feat] Support MoE INT8 Weight-Only-Quantization in PyTorch Workflow by @Yuening-wa in #6629
  • [https://nvbugs/5401114][fix] Unwaive Gemma3 tests by @brb-nv in #6952
  • [None][chore] Bump version to 1.1.0rc1 by @yiqingy0 in #6953
  • [TRTLLM-7157][feat] BREAKING CHANGE Introduce sampler_type, detect sampler according to options by @dcampora in #6831
  • [None][fix] Skip Topk if 0 by @IzzyPutterman in #6934
  • [None][fix] Fix: Using RAII to automatically manage the allocation and release of va_list for potential resource leak by @Fan-Yunfan in #6758
  • [None][feat] Support Yarn on Qwen3 by @byshiue in #6785
  • [None][feat] Add single block version renormalized routing kernel by @ChristinaZ in #6756
  • [None][infra] Waive failed cases in main branch by @EmmaQiaoCh in #6951
  • [https://nvbugs/5390853][fix] Fix _test_openai_lora.py - disable cuda graph by @amitz-nv in #6965
  • [https://nvbugs/5451028][fix] Constrain NemotronSuper test parameters to prevent OOMs by @Naveassaf in #6970
  • [None][infra] update feature_combination_matrix of disaggregated and Eagle3 by @leslie-fang25 in #6945
  • [None][doc] Update gpt oss doc by @bobboli in #6954
  • [None] [feat] Support accurate device iter time by @kaiyux in #6906
  • [TRTLLM-7030][fix] uppercase def value in pd-config by @Shixiaowei02 in #6981
  • [None] [fix] Fix the macro name by @ChristinaZ in #6983
  • [None][infra] Waive failed tests on main 0818 by @EmmaQiaoCh in #6992
  • [None][chore] Remove duplicate test waives by @yiqingy0 in #6998
  • [None][fix] Clean up linking to CUDA stub libraries in build_wheel.py by @MartinMarciniszyn in #6823
  • [None][infra] Cherry-pick #6836 from main branch and improve SSH connection (#6971) by @chzblych in #7005
  • [TRTLLM-7158][feat] Introduce sampler options in trtllm bench by @dcampora in #6855
  • [None][infra] Enable accuracy test for mtp and chunked prefill by @leslie-fang25 in #6314
  • [None][autodeploy] Doc: fix link path in trtllm bench doc by @Fridah-nv in #7007
  • [https://nvbugs/5371480][fix] Enable test_phi3_small_8k by @Wanli-Jiang in #6938
  • [TRTLLM-7014][chore] Add accuracy test for ctx and gen workers with different models by @reasonsolo in #6741
  • [None][refactor] Refactor Torch Compile Backend, MoeLoadBalancer and warmup Logic by @yizhang-nv in #6615
  • [None] [infra] stricter coderabbit pr title generation instructions by @venkywonka in #6918
  • [TRTLLM-6960][fix] enable scaled_mm tests by @dc3671 in #6936
  • [TRTLLM-6991][chore] add DeepSeek-R1 FP8 accuracy tests on Blackwell by @lfr-0531 in #6710
  • [TRTLLM-6541][test] Add NIM Related Cases [StarCoder2_7B] and [Codestral_22B_V01] by @fredricz-20070104 in #6939
  • [https://nvbugs/5454875][ci] Unwaive Mistral Small 3.1 test by @2ez4bz in #7011
  • [TRTLLM-6541][test] Add NIM Related Cases Part 1 by @crazydemo in #6684
  • [https://nvbugs/5458798][fix] Relaxed test threshold, added documentation by @MrGeva in #6997
  • [None][opt] Add batch wait timeout in fetching requests by @Shunkangz in #6923
  • [None][chore] Remove closed bugs by @xinhe-nv in #6969
  • [None][fix] acceptance rate calculation fix in benchmark_serving by @zerollzeng in #6746
  • [None] [doc] Add more documents for large scale EP by @kaiyux in #7029
  • [None] [chore] Update wide-ep genonly scripts by @qiaoxj07 in #6995
  • [TRTLLM-7263][fix] Prevent recreation of cublas handles in lora_grouped_gemm every call by @amitz-nv in #6968
  • [https://nvbugs/5458874][fix] Fix Nemotron-H flaky CUDA graph / overlap scheduler test by @tomeras91 in #6996
  • [https://nvbugs/5455140][fix] unwaive DSR1-fp4 throughput_tp8 by @lfr-0531 in #7022
  • [None][chore] Remo...
Read more

v1.1.0rc0

16 Aug 00:09
26f413a
Compare
Choose a tag to compare
v1.1.0rc0 Pre-release
Pre-release

Announcement Highlights:

  • Model Support

    • Add model gpt-oss (#6645)
    • Support Aggregate mode for phi4-mm (#6184)
    • Add support for Eclairv2 model - cherry-pick changes and minor fix (#6493)
    • Support running heterogeneous model execution for Nemotron-H (#6866)
    • Add whisper support (Bert Attention on SM100 and GPTAttention for cross attention on SM100) (#5527)
  • API

    • BREAKING CHANGE Enable TRTLLM sampler by default (#6216)
  • Benchmark

    • Improve Llama4 performance for small max_seqlen cases (#6306)
    • Multimodal benchmark_serving support (#6622)
    • Add perf-sweep scripts (#6738)
  • Feature

    • Support LoRA reload CPU cache evicted adapter (#6510)
    • Add FP8 context MLA support for SM120 (#6059)
    • Enable guided decoding with speculative decoding (part 1: two-model engine) (#6300)
    • Include attention dp rank info with KV cache events (#6563)
    • Clean up ngram auto mode, add max_concurrency to configs (#6676)
    • Add NCCL Symmetric Integration for All Reduce (#4500)
    • Remove input_sf swizzle for module WideEPMoE (#6231)
    • Enable guided decoding with disagg serving (#6704)
    • Make fused_moe_cute_dsl work on blackwell (#6616)
    • Move kv cache measure into transfer session (#6633)
    • Optimize CUDA graph memory usage for spec decode cases (#6718)
    • Core Metrics Implementation (#5785)
    • Resolve KV cache divergence issue (#6628)
    • AutoDeploy: Optimize prepare_inputs (#6634)
    • Enable FP32 mamba ssm cache (#6574)
    • Support SharedTensor on MultimodalParams (#6254)
    • Improve dataloading for benchmark_dataset by using batch processing (#6548)
    • Store the block of context request into kv cache (#6683)
    • Add standardized GitHub issue templates and disable blank issues (#6494)
    • Improve the performance of online EPLB on Hopper by better overlapping (#6624)
    • Enable guided decoding with CUDA graph padding and draft model chunked prefill (#6774)
    • CUTLASS MoE FC2+Finalize fusion (#3294)
    • Add GPT OSS support for AutoDeploy (#6641)
    • Add LayerNorm module (#6625)
    • Support custom repo_dir for SLURM script (#6546)
    • DeepEP LL combine FP4 (#6822)
    • AutoTuner tuning config refactor and valid tactic generalization (#6545)
    • Hopper W4A8 MoE supports ModelOpt ckpt for PyT backend (#6200)
    • Add support for Hopper MLA chunked prefill (#6655)
    • Helix: extend mapping to support different CP types (#6816)
  • Documentation

    • Remove the outdated features which marked as Experimental (#5995)
    • Add LoRA feature usage doc (#6603)
    • Add deployment guide section for VDR task (#6669)
    • Add doc for multimodal feature support matrix (#6619)
    • Move AutoDeploy README.md to torch docs (#6528)
    • Add checkpoint refactor docs (#6592)
    • Add K2 tool calling examples (#6667)
    • Add the workaround doc for H200 OOM (#6853)
    • Update moe support matrix for DS R1 (#6883)
    • BREAKING CHANGE: Mismatch between docs and actual commands (#6323)

What's Changed

  • Qwen3: Fix eagle hidden states by @IzzyPutterman in #6199
  • [None][fix] Upgrade dependencies version to avoid security vulnerability by @yibinl-nvidia in #6506
  • [None][chore] update readme for perf release test by @ruodil in #6664
  • [None][test] remove trt backend cases in release perf test and move NIM cases to llm_perf_nim.yml by @ruodil in #6662
  • [None][fix] Explicitly add tiktoken as required by kimi k2 by @pengbowang-nv in #6663
  • [None][doc]: remove the outdated features which marked as Experimental by @nv-guomingz in #5995
  • [https://nvbugs/5375966][chore] Unwaive test_disaggregated_deepseek_v3_lite_fp8_attention_dp_one by @yweng0828 in #6658
  • [TRTLLM-6892][infra] Run guardwords scan first in Release Check stage by @yiqingy0 in #6659
  • [None][chore] optimize kv cache transfer for context TEP and gen DEP by @chuangz0 in #6657
  • [None][chore] Bump version to 1.1.0rc0 by @yiqingy0 in #6651
  • [TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter by @amitz-nv in #6510
  • [None][test] correct test-db context for perf yaml file by @ruodil in #6686
  • [None] [feat] Add model gpt-oss by @hlu1 in #6645
  • [https://nvbugs/5409414][fix] fix Not registered specs by @xinhe-nv in #6660
  • [None][feat] : Add FP8 context MLA support for SM120 by @peaceh-nv in #6059
  • [TRTLLM-6092][doc] Add LoRA feature usage doc by @shaharmor98 in #6603
  • [TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) by @syuoni in #6300
  • [TRTLLM-6881][feat] Include attention dp rank info with KV cache events by @pcastonguay in #6563
  • [None][infra] Fix guardwords by @EmmaQiaoCh in #6711
  • [None][package] Pin cuda-python version to >=12,<13 by @yiqingy0 in #6702
  • [None][doc] Add deployment guide section to the official doc website by @nv-guomingz in #6669
  • [None][fix] disagg ctx pp4 + gen pp4 integ test by @raayandhar in #6489
  • [None][feat] Clean up ngram auto mode, add max_concurrency to configs by @mikeiovine in #6676
  • [None][chore] Remove py_executor from disagg gh team by @pcastonguay in #6716
  • [https://nvbugs/5423962][fix] Address broken links by @chenopis in #6531
  • [None][fix] Migrate to new cuda binding package name by @tongyuantongyu in #6700
  • [https://nvbugs/5410687][fix] Hopper w4a8 groupwise MoE interleave by @symphonylyh in #6708
  • [None][feat] Add NCCL Symmetric Integration for All Reduce by @Tabrizian in #4500
  • [TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default by @dcampora in #6216
  • [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6719
  • [TRTLLM-5252][test] add for mistral_small_3.1_24b perf test by @ruodil in #6685
  • [TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE by @StudyingShao in #6231
  • [None][fix] Fix unnecessary GPU synchronization in torch sampler caused by incorrect tensor reference by @zhanghaotong in #6626
  • [TRTLLM-6854][feat] Enable guided decoding with disagg serving by @syuoni in #6704
  • [TRTLLM-5252][fix] Propagate mapping to intermediate layers by @2ez4bz in #6611
  • [None][test] fix yml condition error under qa folder by @ruodil in #6734
  • [None][doc] Add doc for multimodal feature support matrix by @chang-l in #6619
  • [TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell by @limin2021 in #6616
  • [https://nvbugs/5436461][infra] Adjust free_gpu_memory_fraction of test_eagle3 to prevent OOM on CI by @leslie-fang25 in #6631
  • [None][refactor] Combine resmooth_to_fp8_e8m0 and transform_sf_into_required_layout by @yuxianq in #6654
  • [https://nvbugs/5437106][fix] Fix llama4 scout TRTLLM attn_backend by @JunyiXu-nv in #6690
  • [None][fix] Remove lock related typo in py_executor by @lancelly in #6653
  • [None][feat] move kv cache measure into transfer session by @zhengd-nv in #6633
  • [None][fix]revert kvcache transfer by @chuangz0 in #6709
  • [TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly handle padding by @stnie in #6665
  • [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #6184
  • [None][feat] Optimize CUDA graph memory usage for spec decode cases by @mikeiovine in #6718
  • [TRTLLM-7025] [infra] Reorganize CODEOWNERS to rectify examples mapping by @venkywonka in #6762
  • [None][doc] Move AutoDeploy README.md to torch docs by @Fridah-nv in #6528
  • [None][fix] WAR GPT OSS on H20 with Triton MOE by @dongfengy in #6721
  • [TRTLLM-6420][feat] add support for Eclairv2 model - cherry-pick changes and minor fix by @yibinl-nvidia in #6493
  • [None][feat] Core Metrics Implementation by @hcyezhang in #5785
  • [https://nvbugs/5398180][feat] Improve Llama4 performance for small max_seqlen cases by @nv-yilinf in #6306
  • [TRTLLM-6637][feat]...
Read more

v1.0.0rc6

07 Aug 10:54
a16ba64
Compare
Choose a tag to compare
v1.0.0rc6 Pre-release
Pre-release

Announcement Highlights:

  • Model Support

  • Feature

    • Add LoRA support for Gemma3 (#6371)
    • Add support of scheduling attention dp request (#6246)
    • Multi-block mode for Hopper spec dec XQA kernel (#4416)
    • LLM sleep & wakeup Part 1: virtual device memory (#5034)
    • best_of/n for pytorch workflow (#5997)
    • Add speculative metrics for trt llm bench (#6476)
    • (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec (#6379)
    • Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 (#6522)
    • check input tokens + improve error handling (#5170)
    • Add support for fused gate_up_proj scales for FP8 blockwise (#6496)
    • Add vLLM KV Pool support for XQA kernel (#6013)
    • Switch to internal version of MMProjector in Gemma3 (#6572)
    • Enable fp8 SwiGLU to minimize host overhead (#6540)
    • Add Qwen3 MoE support to TensorRT backend (#6470)
    • ucx establish connection with zmq (#6090)
    • Disable add special tokens for Llama3.3 70B (#6482)
  • API

  • Benchmark

    • ADP schedule balance optimization (#6061)
    • allreduce benchmark for torch (#6271)
  • Documentation

    • Make example SLURM scripts more parameterized (#6511)
    • blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) (#6547)
    • Exposing the latest tech blogs in README.md (#6553)
    • update known issues (#6247)
    • trtllm-serve doc improvement. (#5220)
    • Adding GPT-OSS Deployment Guide documentation (#6637)
    • Exposing the GPT OSS model support blog (#6647)
    • Add llama4 hybrid guide (#6640)
    • Add DeepSeek R1 deployment guide. (#6579)
    • Create deployment guide for Llama4 Scout FP8 and NVFP4 (#6550)
  • Known Issues

    • On bare-metal Ubuntu 22.04 or 24.04, please install the cuda-python==12.9.1 package after installing the TensorRT-LLM wheel. This resolves an incompatibility issue with the default cuda-python 13 of error ImportError: cannot import name 'cuda' from 'cuda'.

What's Changed

  • [fix] Fix missing fields in xqa kernel cache key by @lowsfer in #6282
  • [TRTLLM-6364][infra] Validate for PR titles to ensure they follow the required format by @niukuo in #6278
  • [fix] Update get_trtllm_bench_build_command to handle batch size and tokens by @venkywonka in #6313
  • refactor: Remove unused buffers and bindings from sampler by @Funatiq in #6484
  • chore: Make example SLURM scripts more parameterized by @kaiyux in #6511
  • fix: Fix missing key by @zerollzeng in #6471
  • [https://nvbugs/5419066][fix] Use trt flow LLM by @crazydemo in #6467
  • [TRTLLM-4279] fix: Add a protection test for checking trtllm custom ops by @yali-arch in #6515
  • [https://nvbugs/5419069][fix] Fix the mismatched layer name components. by @hyukn in #6417
  • [None][doc] blog: Scaling Expert Parallelism in TensorRT-LLM (Part 2: Performance Status and Optimization) by @kaiyux in #6547
  • [None][chore] Disable add special tokens for Llama3.3 70B by @chenfeiz0326 in #6482
  • [None][doc] Exposing the latest tech blogs in README.md by @juney-nvidia in #6553
  • [None][fix] update nemotron nas tests free_gpu_memory_fraction=0.8 by @xinhe-nv in #6552
  • [None][infra] Pin the version for triton to 3.3.1 (#6508) (#6519) by @chzblych in #6549
  • [https://nvbugs/5340941][https://nvbugs/5375785] - fix: Wrap attentio… by @liji-nv in #6355
  • [TRTLLM-6657][feat] Add LoRA support for Gemma3 by @brb-nv in #6371
  • [https://nvbugs/5381276][fix] fix warning for fused_a_gemm by @yunruis in #6402
  • [None][Infra] - Skip failed tests in post-merge by @EmmaQiaoCh in #6558
  • [AutoDeploy] merge feat/ad-2025-07-22 by @lucaslie in #6520
  • [TRTLLM-6624][feat] skip post blackwell by @xinhe-nv in #6357
  • [TRTLLM-6357][test] Add accuracy tests for Qwen3 by @reasonsolo in #6177
  • [None][fix] Serialize the window_size in the kv event by @richardhuo-nv in #6526
  • [None][feat] Add support of scheduling attention dp request by @Shunkangz in #6246
  • [None][refactor] Simplify finish reasons handling in DecoderState by @Funatiq in #6524
  • [None][infra] add eagle3 one model accuracy tests by @jhaotingc in #6264
  • [TRTLLM-6224][infra] Upgrade dependencies to DLFW 25.06 and CUDA 12.9.1 by @yiqingy0 in #5678
  • use cudaSetDevice to create context ,fix nvbug 5394497 by @chuangz0 in #6403
  • [None][feat] Multi-block mode for Hopper spec dec XQA kernel by @jhaotingc in #4416
  • [TRTLLM-6473][test] add speculative decoding and ep load balance cases into QA test list by @crazydemo in #6436
  • [fix] Fix DeepSeek w4a8 weight loading by @jinyangyuan-nvidia in #6498
  • chore: add EXAONE4 accuracy test by @yechank-nvidia in #6397
  • test: modify max_lora_rank of phi4_multimodal to 320 by @ruodil in #6474
  • [None][chore] Mass integration of release/0.21 (part5) by @dc3671 in #6544
  • [None][infra] update namelist by @niukuo in #6465
  • [https://nvbugs/5430932][infra] update namelist by @niukuo in #6585
  • [None][chore] add online help to build_wheel.py and fix a doc link by @zhenhuaw-me in #6391
  • test: move ministral_8b_fp8 to fp8_specific gpu list(exclude Ampere) by @ruodil in #6533
  • [TRTLLM-5563][infra] Move test_rerun.py to script folder by @yiqingy0 in #6571
  • [None][infra] Enable accuracy test for eagle3 and chunked prefill by @leslie-fang25 in #6386
  • [None][infra] Enable test of chunked prefill with logit post processor by @leslie-fang25 in #6483
  • [TRTLLM-4406][feat] LLM sleep & wakeup Part 1: virtual device memory by @tongyuantongyu in #5034
  • [None][fix] remove closed bugs by @xinhe-nv in #6576
  • [None][fix] xqa precision for fp16/bf16 kv cache by @Bruce-Lee-LY in #6573
  • [None][fix] Revert commit 48ddc3d & add test for disagg server with different max_num_tokens by @LinPoly in #6259
  • [None][chore] Bump version to 1.0.0rc6 by @yiqingy0 in #6597
  • [None][chore] Add unit test for Gemma3 lora by @brb-nv in #6560
  • [TRTLLM-6364] [fix] Update PR title regex to allow optional spaces between ticket and type by @niukuo in #6598
  • [None][infra] Waive failed case in post-merge on main by @EmmaQiaoCh in #6602
  • [None][test] update invalid test name by @crazydemo in #6596
  • [TRTLLM-5271][feat] best_of/n for pytorch workflow by @evezhier in #5997
  • [None][chore] Update Gemma3 closeness check to mitigate flakiness by @brb-nv in #6591
  • [TRTLLM-6685][feat] Add speculative metrics for trt llm bench by @kris1025 in #6476
  • [None][doc] Fix blog4 typo by @syuoni in #6612
  • [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6581
  • [TRTLLM-6856][feat] add disaggregated serving tests to QA list by @xinhe-nv in #6536
  • [https://nvbugs/5433581][infra] Update install docs and CI script for SBSA deep_gemm workaround by @chzblych in #6607
  • [TRTLLM-5990][doc] trtllm-serve doc improvement. by @nv-guomingz in #5220
  • [None][chore] Add readme for perf test by @ruodil in #6443
  • [https://nvbugs/5436461][infra] Skip test_eagle3 test with device memory check by @leslie-fang25 in #6617
  • [None][chore] ucx establish connection with zmq by @chuangz0 in #6090
  • [TRTLLM-6674][feat] (Breaking Change) Hopper SWA non-cyclic kernels + KV reuse + Spec Dec by @symphonylyh in #6379
  • [None][fix] Remove expand configuration from mamba2 mixer by @danielafrimi in #6521
  • [TRTLLM-6826][feat] Allow sending more than 2GiB through MPI by using mpi4py.util.pkl5 by @amitz-nv in h...
Read more

v1.0.0rc5

04 Aug 09:45
fbee279
Compare
Choose a tag to compare
v1.0.0rc5 Pre-release
Pre-release

Announcement Highlights:

  • Model Support
  • Feature
    • Deepseek R1 FP8 Support on Blackwell (#6486)
    • Auto-enable ngram with concurrency <= 32. (#6232)
    • Support turning on/off spec decoding dynamically (#6363)
    • Improve LoRA cache memory control (#6220)
    • Support structural tag in C++ runtime and upgrade xgrammar to 0.1.21 (#6408)
    • Update long rope for phi3.5/phi4-mini/phi4-mm (#6353)
    • Add support for external multimodal embeddings (#6263)
    • Add support for disaggregation with pp with pytorch backend (#6369)
    • Add _prepare_and_schedule_batch function in PyExecutor (#6365)
    • Add status tags to LLM API reference (#5707)
    • Remove cudaStreamSynchronize when using relaxed acceptance (#5262)
    • Support JSON Schema in OpenAI-Compatible API (#6321)
    • Support chunked prefill on spec decode 2 model (#6104)
    • Enhance beam search support with CUDA graph integration (#6217)
    • Enable Overlap scheduler + Beam Search in TRTLLM Sampler (#6223)
    • Add KV cache reuse support for multimodal models (#5444)
    • Multistream initial support for torch compile flow (#5847)
    • Support nanobind bindings (#6185)
    • Support Weight-Only-Quantization in PyTorch Workflow (#5850)
    • Support pytorch LoRA adapter eviction (#5616)
  • API
    • [BREAKING CHANGE] Change default backend to PyTorch in trtllm-serve (#5717)
  • Bug Fixes
    • fix: remove duplicate layer multiplication in KV cache size calculation (#6481)
    • fix illeagel memory access in MLA (#6437)
    • Fix nemotronNAS loading for TP>1 (#6447)
    • Switch placement of image placeholder for mistral 3.1 (#6435)
    • Fix wide EP when using DeepEP with online EPLB (#6429)
    • Move kv_cache_free_gpu_mem_fraction arg to benchmark command in tests (#6463)
    • Fix bugs caused by None attention_bias during Qwen3 model convert engine (#6344)
    • Add a runtime flag to enable fail fast when attn window is too large to fit at least one sequence in KV cache (#5974)
    • Fix PD + MTP + overlap scheduler accuracy issue (#6136)
    • Fix bug of Qwen3 when using fp4 on sm120 (#6065)
  • Benchmark
    • Fixes to parameter usage and low latency configuration. (#6343)
    • Add Acceptance Rate calculation to benchmark_serving (#6240)
  • Performance
    • Enable AllReduce-associated fusion patterns in Llama3/4. (#6205)
    • Optimize Mtp performance (#5689)
    • Customize cublastLt algo for Llamba 3.3 70B TP4 (#6315)
    • Add non UB AR + Residual + Norm + Quant fusion (#6320)
  • Infrastructure
    • Remove auto_assign_reviewers option from .coderabbit.yaml (#6490)
    • Use build stage wheels to speed up docker release image build (#4939)
  • Documentation
    • Add README for wide EP (#6356)
    • Update Llama4 deployment guide: update config & note concurrency (#6222)
    • Add Deprecation Policy section (#5784)
  • Known Issues
    • If you encounter the OSError: CUDA_HOME environment variable is not set error, set the CUDA_HOME environment variable
    • The aarch64 Docker image and wheel package for 1.0.0rc5 are broken. This will be fixed in the upcoming weekly release

What's Changed

Read more

v0.21.0

04 Aug 14:23
751d5f1
Compare
Choose a tag to compare

TensorRT-LLM Release 0.21.0

Key Features and Enhancements

  • Model Support
    • Added Gemma3 VLM support
  • Features
    • Added large-scale EP support
    • Integrated NIXL into the communication layer of the disaggregated service
    • Added fabric Memory support for KV Cache Transfer
    • Added MCP in ScaffoldingLLM
    • Added support for w4a8_mxfp4_fp8 quantization
    • Added support for fp8 rowwise quantization
    • Added generation logits support in TRTLLM Sampler
    • Added log probs support in TRTLLM Sampler
    • Optimized TRTLLM Sampler perf single beam single step
    • Enabled Disaggregated serving for Qwen-3
    • Added EAGLE3 support for Qwen-3
    • Fused finalize and allreduce for Qwen-MoE model
    • Refactored Fused MoE module
    • Added support for chunked attention on Blackwell and Hopper
    • Introduced sliding-window attention kernels for the generation phase on Blackwell
    • Updated DeepSeek FP8 TRT-LLM Gen cubins to improve performance in large batch size scenarios
    • Added FP8 block-scale GEMM support on SM89
    • Enabled overlap scheduler between draft forwards
    • Added Piecewise cuda graph support for MLA
    • Added model-agnostic one-engine eagle3
    • Enabled Finalize + Allreduce + add + rmsnorm fusion
    • Integrated TRT-LLM Gen FP8 block scale MoE with Pytorch workflow kernel autotuner
    • Added support for Eagle3 + disaggregated serving in two model speculative decoding flow
    • Validated Llama 3.1 models on H200 NVL
  • Benchmark:
    • Added all_reduce.py benchmark script for testing
    • Added beam width to trtllm-bench latency command
    • Fixed trtllm-bench iter_stats and cuda_graph_batch_sizes errors
    • Enabled trtllm-bench to run LoRA and add basic e2e perf testing capability for LoRA
    • Supported post_proc for bench
    • Added no_kv_cache_reuse option and streaming support for trtllm serve bench

Infrastructure Changes

  • The base Docker image for TensorRT-LLM is updated to nvcr.io/nvidia/pytorch:25.05-py3.
  • The base Docker image for TensorRT-LLM Backend is updated to nvcr.io/nvidia/tritonserver:25.05-py3.
  • The dependent public PyTorch version is updated to 2.7.1.
  • The dependent TensorRT version is updated to 10.11.
  • The dependent NVIDIA ModelOpt version is updated to 0.31.
  • The dependent NCCL version is updated to 2.27.5.

API Changes

  • Set _AutoDeployLlmArgs as primary config object
  • Removed decoder request from decoder interface
  • Enhanced the torch_compile_config in llm args
  • Removed the redundant use_kv_cache field from PytorchConfig
  • Moved allreduce_strategy from committed api to reference

Fixed Issues

  • Fixed disaggregated service hang when MNNVL two-shot AllReduce is enabled (#4678)
  • Fixed EP load balancer with MTP layer and route offset by EP rank (#4767)
  • Fixed cuda graph padding for spec decoding (#4853)
  • Fixed llama 4 long context issue (#4809)
  • Fixed max_num_sequences calculation with overlap scheduling (#4532)
  • Fixed chunked prefill + overlap scheduling (#5761)
  • Fixed trtllm-bench hang issue due to LLM API IPC (#4798)
  • Fixed index out of bounds error in spec decoding (#5954)
  • Fixed MTP illegal memory access in cuda graph warmup (#5947)
  • Fixed no free slots error with spec decode + disagg (#5975)
  • Fixed one-off attention window size for Gemma3 1B (#5564)

Known Issues

  • accuracy/test_cli_flow::TestGpt2::test_beam_search_large is broken.
  • Enabling disaggregated serving, MTP, and the overlap scheduler at the same time can lead to accuracy problems.
  • In 0.21, full chunked attention support has been added to make sure LLaMA4 model can functionally run with > 8K seq length, while there is a known performance regression(only affect LLaMA4 model) on Hopper due to this functional enhancement. The root cause of the regression has been identified already and the fix will be part of the future release.

What's Changed

Read more

v1.0.0rc4

22 Jul 08:24
69e9f6d
Compare
Choose a tag to compare
v1.0.0rc4 Pre-release
Pre-release

Announcement Highlights:

  • Model Support
    • Add phi-4-multimodal model support (#5644)
    • Add EXAONE 4.0 model support (#5696)
  • Feature
    • Add support for two-model engine KV cache reuse (#6133)
    • Unify name of NGram speculative decoding (#5937)
    • Add retry knobs and handling in disaggregated serving (#5808)
    • Add Eagle-3 support for qwen3 dense model (#5879)
    • Remove padding of FusedMoE in attention DP (#6064)
    • Enhanced handling of decoder requests and logits within the batch manager (#6055)
    • Add support for Modelopt fp8_pb_wo quantization scheme (#6106)
    • Update deepep dispatch API (#6037)
    • Add support for benchmarking individual gemms in MOE benchmark (#6080)
    • Simplify token availability calculation for VSWA (#6134)
    • Migrate EAGLE3 and draft/target speculation to Drafter (#6007)
    • Enable guided decoding with overlap scheduler (#6000)
    • Use cacheTransceiverConfig as knobs for disagg service (#5234)
    • Add vectorized loading for finalize kernel in MoE Trtllm backend (#5919)
    • Enhance ModelConfig for kv cache size calculations (#5868)
    • Clean up drafter/resource manager creation logic (#5805)
    • Add core infrastructure to enable loading of custom checkpoint formats (#5372)
    • Cleanup disable_fp4_allgather (#6006)
    • Use session abstraction in data transceiver and cache formatter (#5611)
    • Add support for Triton request cancellation (#5898)
    • Support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-constrained GPUs (#5684)
    • Remove enforced sorted order of batch slots (#3502)
    • Use huge page mapping for host accessible memory on GB200 (#5963)
  • API
    • [BREAKING CHANGE] Unify KvCacheConfig in LLM class for pytorch backend (#5752)
    • [BREAKING CHANGE] Rename cuda_graph_config padding_enabled field (#6003)
  • Bug Fixes
    • Skip prompt length checking for generation only requests (#6146)
    • Avoid memory calls during broadcast for single GPU (#6010)
    • Record kv-cache size in MLACacheFormatter (#6181)
    • Always use py_seq_slot in runtime (#6147)
    • Update beam search workspace estimation to new upper bound (#5926)
    • Update disaggregation handling in sampler (#5762)
    • Fix TMA error with GEMM+AR on TP=2 (#6075)
    • Fix scaffolding aime test in test_e2e (#6140)
    • Fix KV Cache overrides in trtllm-bench (#6103)
    • Remove duplicated KVCache transmission check (#6022)
    • Release slots with spec decode + disagg (#5975) (#6032)
    • Add propogation for trust_remote_code to AutoConfig (#6001)
    • Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop (#6053)
    • Pad DeepEP fp4 recv tensors if empty (#6048)
    • Adjust window sizes of VSWA at torch backend (#5880)
    • Fix MOE benchmark to rotate buffers to prevent L2 cache reuse (#4135)
    • Fix eagle3 two model disaggregated serving test (#6014)
    • Update torch.compile option to fix triton store_cubin error (#5865)
    • Fix chunked prefill + overlap scheduling (#5761)
    • Fix mgmn postprocess error (#5835)
    • Fallback to cubins for fp8 fmha kernels on Ada (#5779)
    • Enhance _check_arguments to filter illegal requests for pytorch backend (#5541)
    • Rewrite completion API to avoid repetitive tokens (#5201)
    • Fix disagg + speculative decoding (#5558)
  • Benchmark
    • Add latency support for trtllm bench (#3730)
  • Performance
    • Optimize TRTLLM Sampler perf single beam single step (#5550)
    • Performance Optimization for MNNVL TwoShot Kernel (#5925)
    • Enable 128x256 tile shapes for FP4 MOE CUTLASS backend (#5986)
    • Enable cuda graph by default (#5480)
  • Infrastructure
    • Add script to map tests <-> jenkins stages & vice-versa (#5431)
    • Speedup beam search unit tests with fixtures for LLM (#5843)
    • Fix single-GPU stage failed will not raise error (#6165)
    • Update bot help messages (#5277)
    • Update jenkins container images (#6094)
    • Set up the initial config for CodeRabbit (#6128)
    • Upgrade NIXL to 0.3.1 (#5991)
    • Upgrade modelopt to 0.33 (#6058)
    • Support show all stage name list when stage name check failed (#5946)
    • Run docs build only if PR contains only doc changes (#5184)
  • Documentation
    • Update broken link of PyTorchModelEngine in arch_overview (#6171)
    • Add initial documentation for trtllm-bench CLI. (#5734)
    • Add documentation for eagle3+disagg+dynamo (#6072)
    • Update llama-3.3-70B guide (#6028)
  • Known Issues

What's Changed

  • [TRTLLM-6164][TRTLLM-6165] chore: add runtime example for pytorch by @Superjomn in #5956
  • fix: Fix MoE benchmark by @syuoni in #5966
  • [TRTLLM-6160] chore: add sampling examples for pytorch by @Superjomn in #5951
  • Use huge page mapping for host accessible memory on GB200 by @dongxuy04 in #5963
  • Breaking change: perf: [TRTLLM-4662] Enable cuda graph by default by @dominicshanshan in #5480
  • fix: set allreduce strategy to model config by @WeiHaocheng in #5955
  • chore: Mass integration of release/0.21 (part 3) by @dc3671 in #5909
  • infra: [TRTLLM-6242] install cuda-toolkit to fix sanity check by @ZhanruiSunCh in #5709
  • Waive L0 test by @yiqingy0 in #6002
  • [Nvbug/5383670] fix: switch test case to non-fp4 ckpt for more GPU coverage by @kaiyux in #5882
  • fix #4974: A thread leak issue in scaffolding unittest by @ccs96307 in #5020
  • feat: EXAONE4.0 support by @yechank-nvidia in #5696
  • [TRTLLM-5653][infra] Run docs build only if PR contains only doc changes by @zhanga5 in #5184
  • feat: Update Gemma3 Vision Encoder by @brb-nv in #5973
  • enh: Bidirectional mask with multiple images for Gemma3 by @brb-nv in #5976
  • refactor: Remove enforced sorted order of batch slots by @Funatiq in #3502
  • [fix] fix eagle3 two model disaggregated serving test by @Tabrizian in #6014
  • perf: Enable 128x256 tile shapes for FP4 MOE CUTLASS backend by @djns99 in #5986
  • [nvbugs-5318143] fix: restrict PyTorch memory usage to avoid OOMs by @ixlmar in #5964
  • doc: update EXAONE 4.0 news by @yechank-nvidia in #6034
  • [Model load] Fix llama min-latency model load by @arekay in #5883
  • fix: Fix MOE benchmark to rotate buffers to prevent L2 cache reuse by @djns99 in #4135
  • Doc: Update llama-3.3-70B guide by @jiahanc in #6028
  • infra: [TRTLLM-6331] Support show all stage name list when stage name check failed by @ZhanruiSunCh in #5946
  • [Infra][TRTLLM-6013] - Fix stage name in single stage test rerun report by @yiqingy0 in #5672
  • [Fix] check for ImportError or ModuleNotFoundError for deep_ep_utils by @lucaslie in #6026
  • infra: [TRTLLM-6313] Fix the package sanity stage 'Host Node Name' in… by @ZhanruiSunCh in #5945
  • chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… by @nv-guomingz in #6003
  • test: add recursive updating pytorch config and change MOE backend format in perf test by @ruodil in #6046
  • test: add llama_v3.3_70b_cases in perf test by @ruodil in #6035
  • [infra] add more log on reuse-uploading by @niukuo in #6036
  • fix: adjust window sizes of VSWA at torch backend by @jaedeok-nvidia in #5880
  • [nvbugs/5385972][nvbugs/5387423][Fix] Minor fix for llava_next/llava_onevision by @MinaHuai in #5998
  • Fix: pad DeepEP fp4 recv tensors if empty by @yuantailing in #6048
  • [fix] Move NCCL group in all-gather and reduce-scatter OPs outside the outer loop by @jinyangyuan-nvidia in #6053
  • support TRTLLM_DEEP_EP_TOKEN_LIMIT to allow run deep-ep on memory-con… by @ttyio in #5684
  • Cherry-pick #5947 by @lfr-0531 in #5989
  • test: Add regression tests for Gemma3 VLM by @brb-nv in #6033
  • feat/add latency support for trtllm bench by @danielafrimi in #3730
  • feat: Add support for Triton request cancellation by @achartier in #5898
  • [fix] Fix Triton build by @Tabrizian in #6076
  • fix: Unable to load phi4-model with tp_size>1 by @Wanli-Jiang in #5962
  • chore: Bump version to 1.0.0rc4 by @yiqingy0 in #6086
  • chroe: upgrade modelopt to 0.33 by @nv-guomingz in #6058
  • [nvbug/5347489][nvbug/5388036] increase timeout in disagg worker test by @zhengd-nv in #6041
  • feat: use sessi...
Read more

v1.0.0rc3

16 Jul 08:25
cfcb97a
Compare
Choose a tag to compare
v1.0.0rc3 Pre-release
Pre-release

Announcement Highlights:

  • Model Support
    • Support Mistral3.1 VLM model (#5529)
    • Add TensorRT-Engine Qwen3 (dense) model support (#5650)
  • Feature
    • Add support for MXFP8xMXFP4 in pytorch (#5411)
    • Log stack trace on error in openai server (#5749)
    • Refactor the topk parallelization part for the routing kernels (#5705)
    • Adjust free GPU memory fraction in KvCacheConfig for DeepSeek R1 tests (#5774)
    • Support FP8 row-wise dense GEMM in torch flow (#5615)
    • Move DeepEP from Docker images to wheel building (#5534)
    • Add user-provided speculative decoding support (#5204)
    • Add optional module cache for TRT-LLM Gen Gemm interfaces (#5743)
    • Add streaming scaffolding_llm.generate_async support (#5345)
    • Detokenize option in /v1/completions request (#5382)
    • Support n-gram speculative decoding with disagg (#5732)
    • Return context response immediately when stream_interval > 1 (#5836)
    • Add support for sm121 (#5524)
    • Add LLM speculative decoding example (#5706)
    • Update xgrammar version to 0.1.19 (#5830)
    • Some refactor on WideEP (#5727)
    • Integrate TRT-LLM Gen FP4 block scale MoE with Pytorch workflow kernel autotuner (#5764)
    • Update transformers to 4.53.0 (#5747)
    • Share PyTorch tensor between processes (#5396)
    • Custom masking utils for Gemma3 VLM (#5853)
    • Remove support for llmapi + TRT backend in Triton (#5856)
    • Add TRTLLM MoE nvfp4 cubins for mid-high concurrency; attention_dp for TRTLLM MoE (#5723)
    • Enable kvcache to be reused during request generation (#4028)
    • Simplify speculative decoding configs (#5639)
    • Add binding type build argument (pybind, nanobind) (#5802)
    • Add the ability to write a request timeline (#5258)
    • Support deepEP fp4 post quant all2all dispatch (#5881)
    • Refactor the rest routing part for the routing kernels in the MoE TRT-LLM backend (#5771)
    • Move vision parts from processor to model for Gemma3 (#5888)
  • API
    • [BREAKING CHANGE] Rename mixed_sampler to enable_mixed_sampler (#5751)
    • [BREAKING CHANGE] Rename LLM.autotuner_enabled to enable_autotuner (#5876)
  • Bug Fixes
    • Fix test_generate_with_seed CI failure. (#5772)
    • Improve fp4_block_scale_moe_runner type check (#5681)
    • Fix prompt adapter TP2 case (#5782)
    • Fix disaggregate serving with attention DP (#4993)
    • Ignore nvshmem_src_*.txz from confidentiality-scan (#5831)
    • Fix a quote error introduced in #5534 (#5816)
    • Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. (#5801)
    • Fix lost requests for disaggregated serving (#5815)
    • Update unit tests: skip all_close assert for dropout in attention, increase tolerance for rope op test (#5855)
    • Fix GEMM+AR fusion on blackwell (#5563)
    • Catch inference failures in trtllm-bench (#5841)
    • Add is_fp8_output key to XQA kernel cubin hashing (solves Eagle3-one-engine Hopper fp8 bug) (#5813)
    • Skip rope scaling for local layers in Gemma3 VLM (#5857)
    • Fix llama4 multimodal support (#5809)
    • Fix Llama4 Scout FP4 crash issue (#5925)
    • Fix max batch size and max tokens in kv cache estimations for Nemotron-H (#5371)
    • Fix moe regression for sm120 (#5823)
    • Fix Qwen2.5VL FP8 support (#5029)
    • Fix the illegal memory access issue in moe gemm on SM120 (#5636)
    • Avoid nesting NCCL group in allgather and reduce scatter OPs (#5866)
    • Fix tileN cannot % 16==0 & support sm89 deepgemm bmm (#5531)
    • Fix incremental detokenization (#5825)
    • Fix MoE workspace info by storing Torch tensor itself instead of data_ptr (#5900)
    • Make the bench serving script compatible with different usages (#5905)
    • Fix mistral unit tests due to transformers upgrade (#5904)
    • Fix the Llama3.1 405B hanging issue. (#5698) (#5925)
    • Fix Gemma3 unit tests due to transformers upgrade (#5921)
    • Extend triton exit time for test_llava (#5971)
    • Fix alltoall for llama4 (apply_router_weight_on_input=True) (#5902)
    • Remove SpecConfig and fix thread leak issues (#5931)
    • Fast redux detection in trtllm gen routing kernel (#5941)
    • Fix cancel request logic (#5800)
    • Fix errors in wide-ep scripts (#5992)
    • Fix error in post-merge-tests (#5949)
  • Benchmark
  • Performance
    • Optimize TRTLLM Sampler perf single beam single step (#5550)
  • Infrastructure
    • Fix a syntax issue in the image check (#5775)
    • Speedup fused moe tests (#5726)
    • Set the label community action to only run on upstream TRTLLM (#5806)
    • Update namelist in blossom-ci (#5838)
    • Update nspect version (#5832)
    • Reduce redundant test cases for TRTLLM Gen FP8 MoE (#5845)
    • Parallelize torch unittests (#5714)
    • Use current_image_tags.properties in rename_docker_images.py (#5846)
    • Fix two known NSPECT high vulnerability issues and reduce image size (#5434)
  • Documentation
    • Update the document of qwen3 and cuda_graph usage (#5705)
    • Update cuda_graph_config usage part in DS R1 docs (#5796)
    • Add llama4 Maverick eagle3 and max-throughput and low_latency benchmark guide (#5810)
    • Fix link in llama4 Maverick example (#5864)
    • Add instructions for running gemma in disaggregated serving (#5922)
    • Add qwen3 disagg perf metrics (#5822)
    • Update the disagg doc (#5938)
    • Update the link of the diagram (#5953)
  • Known Issues

What's Changed

  • feat: Add support for MXFP8xMXFP4 in pytorch by @djns99 in #5535
  • [Doc] update the document of qwen3 and cuda_graph usage by @byshiue in #5703
  • [Infra] - Fix a syntax issue in the image check by @chzblych in #5775
  • chore: log stack trace on error in openai server by @zhengd-nv in #5749
  • fix: [nvbug/5368507] Fix test_generate_with_seed CI failure. by @bobboli in #5772
  • Refactor the topk parallelization part for the routing kernels by @ChristinaZ in #5567
  • test: [CI] remove closed bugs by @xinhe-nv in #5770
  • [TRTLLM-5530][BREAKING CHANGE] refactor: LLM arglist rename mixed_sampler to enable_mixed_sampler by @Superjomn in #5751
  • fix: Adjust free GPU memory fraction in KvCacheConfig for DeepSeek R1 tests by @yizhang-nv in #5774
  • [TRTLLM-5812][feat] support FP8 row-wise dense GEMM in torch flow by @DylanChen-NV in #5615
  • feat: Optimize TRTLLM Sampler perf single beam single step by @dcampora in #5550
  • Refactor: move DeepEP from Docker images to wheel building by @yuantailing in #5534
  • [TRTLLM-6291] feat: Add user-provided speculative decoding support by @Funatiq in #5204
  • [ci] speedup fused moe tests by @omera-nv in #5726
  • [feat] Adds optional module cache for TRT-LLM Gen Gemm interfaces by @davidclark-nv in #5743
  • chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie… by @nv-guomingz in #5795
  • feat: add MultimodalParams & putting all multimodal params into it and refactor HyperCLOVAX & Qwen2/2.5-VL by @yechank-nvidia in #5522
  • Revert "chore: [Breaking Change] Rename cuda_graph_config padding_enabled fie…" by @nv-guomingz in #5818
  • [fix] https://nvbugs/5333654 Unwaive to check ci status and improve torch compile multi-gpu coverage by @liji-nv in #5700
  • [fix] improve fp4_block_scale_moe_runner type check by @Alcanderian in #5681
  • feat(scaffolding): add streaming scaffolding_llm.generate_async support by @dc3671 in #5345
  • [None][infra] Set the label community action to only run on upstream TRTLLM by @poweiw in #5806
  • Waive some test_llama_eagle3 unittests by @venkywonka in #5811
  • [NvBug 5362426] fix: Fix prompt adapter TP2 case by @syuoni in #5782
  • chore: bump version to 1.0.0rc3 by @yiqingy0 in #5819
  • doc: update cuda_graph_config usage part in DS R1 docs by @nv-guomingz in #5796
  • fix: Disaggregate serving with attention DP by @VALLIS-NERIA in #4993
  • Fix: ignore nvshmem_src_*.txz from confidentiality-scan by @yuantailing in #5831
  • tests: waive failed cases on main by @xinhe-nv in #5781
  • [Infra] - Waive L0 test by @yiqingy0 in #5837
  • update namelist in blossom-ci by @niukuo in #5838
  • Fix a quote error introduced in #5534 by @yuantailing in #5816
  • [feat]: Detokenize option in /v1/completions request by @Wokzy in #5382
  • [5305318] fix: Fix the accuracy issue when reduce_fusion is enabled for GEMMA model. by @hyukn in #5801
  • [TRTLLM-5847][feat] Support n-gram speculative decoding with disagg by @raayandhar in #5732
  • [TRTLLM-5878] update nspect version by @niukuo in #5832
  • feat: Return context response immediately when stream_interval > 1 by @kaiyux in https://github.c...
Read more