Releases: NVIDIA/TensorRT-LLM
v1.3.0rc1
Highlights
-
Model Support
-
API
-
Feature
- Update disagg slurm scripts (#10712)
- Re-implement MicroBatchScheduler and CapacityScheduler in Python (#10273)
- Fix sharding dashboard errors (#10786)
- Async Transfer Manager (#9891)
- Speculative One Model: FlashInfer sampling (#10284)
- Refactor speculative decoding workers (#10768)
- Use global unique id as disagg request id (#10187)
- Enable guided decoding with reasoning parsers (#10890)
- Support partial update weight for fp8 (#10456)
- Multi-LoRA serving with CUDA Graph (#8279)
- Support logprobs for Completions API (#10809)
- Eagle3 Specdec UX improvements (#10124)
- Python transceiver components (step 2) (#10494)
- Upgrade NIXL to v0.9.0 (#10896)
- KV Connector Support for MTP (#10932)
- Support overlap scheduler for disagg ctx instances (#10755)
- Adding implementation of KVCacheManagerV2 (#10736)
- Switch to ConfigurableMoE as the default path (#10792)
-
Fix
- Enable system memory to transfer active message in NIXL ucx (#10602)
- Fix the potential misaligned access due to vectorized ld/st instructions in NVLinkOneSided A2A (#10539)
- Default disable gemm+allreduce fusion (#10656)
- Fix vulnerability urllib3 and nbconvert (#10551)
- Fix overlap scheduler race condition (#10610)
- Replace pickle.load with restricted Unpickler (#10622)
- Fix copy start_logs in disagg slurm scripts (#10840)
- Cherry-pick: Disable short profile for tunable ops with MERGE strategy (#10844, #10715)
- Lock resource to fix potential access to released data (#10827)
- Cherry-pick: Fix accuracy issue of TWO-SHOT AllReduce kernel (#10841, #10654)
- Remove weight tensor holder to release memory earlier (#10876)
- Add missing dist strategy param and fix typo for ad_logger (#10892)
- Update RMSNorm custom op plumbing (#10843)
- Fix hmac launch (#10434)
- Avoid Double update for previous batch (#9888)
- Re-init TRTLLM sampler to use sample stream in multi-stream cases (#10918)
- Mtp with async scheduler (#10941)
- Fix buffer reuse (#10716)
- Cherry-pick: Fix hanging issue for MNNVL Allreduce under PP (#10750, #10633)
- Workaround for flashinfer.sampling.sampling_from_logits (#10713)
- Fix port 8000 being used issue in stress test (#10756)
-
Documentation
-
Test & Infra
- Upload regression info to artifactory (#10599)
- Add sonarqube scanning in lockfile generation pipeline (#10700)
- Add Nemotron Nano v3 FP8 autodeploy perf test (#10603)
- Remove trt flow tests in NIM (#10731)
- Update config.yaml of slurm scripts to align with submit.py change (#10802)
- Add a timeout in MNNVL throughput to prevent hangs if one rank crashes (#9532)
- Trigger multi-gpu tests when install_nixl/ucx.sh is modified (#10624)
- Add DGX-Spark VLM accuracy and perf spec dec cases (#10804)
- Fix test list llm_spark_func.txt (#10921)
- Add test configurable moe module multi gpu (#10699)
- NVFP4 MoE - Move weights transformation to fusion phase (#10803)
- Update flashinfer-python to 0.6.1 (#10872)
- Improve disagg acc tests (#10833)
- Refine placement group in ray executor (#10235)
- Regenerate out dated lock file (#10940)
- Remove long-running sanity check tests on GH200 (#10924, #10969)
- Add dgx-spark beta notes (#10766)
- Modify ctx config in 128k8k disagg cases (#10779)
- Balanced random MoE workload generator for CuteDSL kernel UT, autotuner and layerwise benchmark (#10279)
What's Changed
- [#10696][fix] AutoDeploy prevent torch.export from specializing batch dimension when max_batch_size=1 by @MrGeva in #10697
- [None][infra] Add sonarqube scanning in lockfile generation pipeline by @yuanjingx87 in #10700
- [https://nvbugs/5769712][fix] fix timeout in AutoDeploy llama accuracy test by @lucaslie in #10461
- [#10688][fix] AutoDeploy Fix CUDA graph batch sizes exceeding max_batch_size by @MrGeva in #10687
- [#10642][feat] AutoDeploy: optimized canonicalize_graph utilities [1/2] by @lucaslie in #10675
- [https://nvbugs/5769890][fix] enable system memory to transfer active message in NIXL ucx by @chuangz0 in #10602
- [https://nvbugs/5814247][fix] unwaive AutoDeploy multi-gpu unit tests by @lucaslie in #10769
- [TRTLLM-10300][feat] Upload regression info to artifactory by @chenfeiz0326 in #10599
- [None][chore] Add release/1.2 branch into lockfile generation schedule by @yiqingy0 in #10790
- [TRTLLM-9581][infra] Use /home/scratch.trt_llm_data_ci in computelab by @ZhanruiSunCh in #10616
- [None][infra] Waive failed cases for main on 01/19 by @EmmaQiaoCh in #10794
- [#10607][chore] Add Nemotron Nano v3 FP8 autodeploy perf test by @MrGeva in #10603
- [None][feat] Update disagg slurm scripts by @qiaoxj07 in #10712
- [None][test] adjust the dis-agg test timeout threshold by @Shixiaowei02 in #10800
- [None][chore] docs: clarify LoRA is not supported with --use_fp8_rowwise in Fp8RowwiseAttention (see #2603) by @ssam18 in #10320
- [None][chore] Remove trt flow tests in NIM by @jieli-matrix in #10731
- [None][chore] update config.yaml of slurm scripts to align with submit.py change by @dc3671 in #10802
- [https://nvbugs/5776445][chore] unwaive test by @reasonsolo in #10667
- [TRTLLM-10029][scheduler] Re-implement MicroBatchScheduler and CapacityScheduler in Python by @lancelly in #10273
- [TRTLLM-10296][fix] Fix the potential misaligned access due to vectorized ld/st instructions in NVLinkOneSided A2A. by @bobboli in #10539
- [None][chore] Add failed cases into waives.txt by @xinhe-nv in #10776
- [None][fix] default disable gemm+allreduce fusion by @benzh-2025 in #10656
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10787
- [None][fix] Fix vulnerability urllib3 and nbconvert by @yiqingy0 in #10551
- [None][test] Update sanity test list by @xinhe-nv in #10825
- [None][fix] Remove unused params in attn by @yizhang-nv in #10652
- [TRTLLM-10785][feat] Fix sharding dashboard errors by @greg-kwasniewski1 in #10786
- [https://nvbugs/5701445][chore] unwaive test. by @yuxianq in #10806
- [None][infra] trigger multi-gpu tests when install_nixl/ucx.sh is mod… by @bo-nv in #10624
- [None][infra] Waive failed cases for main branch on 01/20 by @EmmaQiaoCh in #10829
- [None][chore] Reduce tedious logs by @chzblych in #10847
- [#10707][fix] AutoDeploy: Super accuracy test fixes by @galagam in #10717
- [None][chore] Async Transfer Manager by @jthomson04 in #9891
- [None][fix] fix duplicate entry in waives.txt by @lucaslie in #10853
- [None][feat] Speculative One Model: FlashInfer sampling by @IzzyPutterman in #10284
- [https://nvbugs/5670108][fix] Fix overlap scheduler race condition in… by @SimengLiu-nv in #10610
- [https://nvbugs/5760737][test] only skip mooncake+indexerkcache test by @zhengd-nv in #10266
- [https://nvbugs/5759698][fix] unwaive test_base_worker by @Superjomn in #10669
- [None][fix] Add a timeout in MNNVL throughput to prevent hangs if one rank crashes by @djns99 in #9532
- [https://nvbugs/5670458][chore] Unwaive reward model test by @shuyixiong in #10831
- [None][chore] Revert #10847 by @chzblych in #10869
- [https://nvbugs/5775021] [fix] Replace pickle.load with restricted Unpickler by @yibinl-nvidia in #10622
- [None][fix] Fix copy start_logs in disagg slurm scripts by @qiaoxj07 in #10840
- [None][fix] Cherry-pick #10715: Disable short profile for tunable ops with MERGE strategy by @hyukn in #10844
- [https://nvbugs/5740377][fix] Lock resource to fix potential access to released data by @HuiGao-NV in #10827
- [https://nvbugs/5814253][fix] unwaive test_autotuner_di...
v1.2.0rc6.post2
What's Changed
- [None][fix] enable EPLB for DEEPGEMM by @xxi-nv in #10618
- [https://nvbugs/5811697][fix] Fix buffer reuse for release/1.2.0rc6.post1 by @yuxianq in #10734
- [None][fix] impl fused triton kernel for e8m0 resmooth (target release/1.2.0rc6.post1, cherry-pick from #10327 and #10770) by @yuxianq in #10771
- [None][chore] Bump version to 1.2.0rc6.post2 by @yiqingy0 in #10907
Full Changelog: v1.2.0rc6.post1...v1.2.0rc6.post2
v1.3.0rc0
Highlights
-
Model Support
-
API Improvements
- Added processed logprobs functionality to TorchSampler (#9675)
- Added support for image_embeds in OpenAI API (#9715)
- Covered LLM API
multi_modal_embeddings(#9963) - Implemented GET/DELETE v1/responses/{response_id} endpoints (#9937)
- Use RequestError for validation errors to prevent engine shutdown (#9761)
-
Performance Optimizations
- Added Hopper XQA decode support for skip softmax attention (#10264)
- Enabled attention data parallelism for Nemotron Super v3 (#10347)
- Added fp4 GEMM with AllReduce support (#9729)
- Use XQA JIT implementation by default with sliding window perf optimization (#10335)
- Reduced host overhead for unified nvfp4 GEMM tuning path (#10503)
- Implemented fused Triton kernel for e8m0 resmooth to reduce memory footprint (#10327)
-
MoE (Mixture of Experts) Enhancements
- Added ExpertStatistic and DUMMY_ALLREDUCE for configurable MoE (#10401)
- Added test configurable MoE module (#10575)
- Implemented padding empty chunk for configurable MoE (#10451)
- Enabled EPLB for DEEPGEMM (#10617)
- Extended MoE quantization test utilities with comprehensive quant algorithm support (#10691)
-
Disaggregation Features
-
Auto Deploy
-
Fixes
- Fixed PP loop hang caused by i-sending new requests (#10665)
- Avoided write-write race for async PP send (#10488)
- Fixed hang issue when enabling skip softmax on Blackwell (#10490)
- Fixed hanging issue for MNNVL Allreduce under PP (#10633)
- Implemented PP skip forward for all spec workers (#10578)
- Added warning for gen-only paused state (#10664)
- Used uint64_t as dtype of lamport_buffer_size to avoid overflow (#10499)
- Fixed HelixCpMnnvlMemory initialization with PP (#10533)
- Fixed regression in KV cache resize memory estimation (#10726)
- Prevented out-of-bounds read (#9879)
- Solved pillow version conflict (#10537)
- Support to parse the keyword modules_to_not_convert of HF model config (#10527)
- Used correct model names for config database regression tests (#10192)
- Support GuidedDecoder with sharded logits (#10698)
- Fixed Piecewise CUDA Graph for GPTOSS (#10631)
- Fixed AutoDeploy EP sharding test (#10460)
- Fixed the nvfp4 fused_moe in AutoDeploy (#10727)
- Added quantization check for DeepEP LL low precision combine in new MoE comm API (#10072)
- Fixed AIPerf issue (#10666)
- Disabled TinyGEMM PDL due to accuracy issues (WAR) (#10619)
- Only keep a limited number of performance statistic data (#10569)
- Convert to CUDA tensor before calling _resmooth_kernel (#10770)
-
Test & Infra
- Added hang detection for executor loop and worker (#10480)
- Implemented bot to send performance regression messages to Slack channel (#10489)
- Made model initialization more general and support weights loading in layer-wise benchmarks (#10562)
- Updated trtllm-gen to support groupsTokensHeadsQ (#10261)
- Added support to export data in trtllm-eval (#10075)
- Added Torch extension API for FusedAddRMSNormQuant kernel (#9905)
- Enabled ray tests (#10272)
- Prevented flaky failures in C++ test_e2e.py by using local cached datasets (#10638)
- Enabled partial reuse in Gemma and GPT OSS test (#10559)
What's Changed
- [TRTLLM-10195][feat] K-EXAONE support by @yechank-nvidia in #10355
- [None][test] update core test list by @crazydemo in #10538
- [#8391][chore] removed llama and added deepseek to AutoDeploy's L0 perf test by @MrGeva in #10585
- [TRTLLM-10022][feat] Add hopper xqa decode support for skip softmax attention by @pengbowang-nv in #10264
- [None][chore] update waive list by @jieli-matrix in #10577
- [None][feat] Add ExpertStatistic and DUMMY_ALLREDUCE for configurable_moe by @qiaoxj07 in #10401
- [TRTLLM-10248][feat] Support Bot to Send Perf Regression Msg to Slack Channel by @chenfeiz0326 in #10489
- [None][chore] update deepseekv3.2 test parameter by @yingguo-trt in #10595
- [None][test] Remove most TRT-backend test cases in llm_perf_nim.yml by @yufeiwu-nv in #10572
- [https://nvbugs/5794796][chore] waive test blocking premerge by @dc3671 in #10593
- [None][fix] Solve pillow version conflict by @Wanli-Jiang in #10537
- [TRTLLM-9522][test] cover LLM API
multi_modal_embeddingsby @ixlmar in #9963 - [None][infra] Waive failed tests for main 01/12 by @EmmaQiaoCh in #10604
- [#10580][fix] re-enable NemotronH MOE MMLU test by @suyoggupta in #10594
- [https://nvbugs/5761391][fix] Use correct model names for config database regression tests by @anish-shanbhag in #10192
- [None][chore] Print correct backend name in benchmark report by @galagam in #10597
- [https://nvbugs/5689235][fix] Fix cancellation+chunked prefill+disagg by @Tabrizian in #10111
- [https://nvbugs/5762336][fix] support to parse the keyword modules_to_not_convert of the HF model config" by @xxi-nv in #10527
- [None][chore] Fix disagg assert by @fredricz-20070104 in #10596
- [TRTLLM-10271][test] Add Spark QA functional and performance cases by @JennyLiu-nv in #10564
- [None][infra] try removing shared cache dir mount by @tburt-nv in #10609
- [None][infra] Update allowlist 2026.01.08 by @niukuo in #10535
- [None][feat] Hang detection for executor loop and worker. by @yuxianq in #10480
- [TRTLLM-8462][feat] Support GET/DELETE v1/responses/{response_id} by @JunyiXu-nv in #9937
- [TRTLLM-10060][feat] Enable attention dp for Nemotron Super v3. by @nv-guomingz in #10347
- [https://nvbugs/5788127][fix] Use uint64_t as the dtype of lamport_buffer_size to avoid overflow by @yilin-void in #10499
- [NVBUG-5670458][chore] Unwaive lp tests by @hchings in #10524
- [TRTLLM-8425][doc] document Torch Sampler details by @ixlmar in #10606
- [None][feat] Layer-wise benchmarks: make model init more general and support weights loading by @yuantailing in #10562
- [None][test] Unwaive qwen3 next test case. by @nv-guomingz in #9877
- [None][feat] add fp4 gemm + allreduce by @benzh-2025 in #9729
- [None][infra] support overriding nspect version by @niukuo in #10402
- [https://nvbugs/5772396][fix] WAR: Disable TinyGEMM PDL due to accuracy issues by @dongfengy in #10619
- [None][feat] AutoDeploy: refactor memory usage logging by @nzmora-nvidia in #8505
- [#9283][feat] AutoDeploy: separate rms pattern detection from fusion by @Fridah-nv in #9969
- [https://nvbugs/5791900][fix] Fix HelixCpMnnvlMemory init with PP by @brb-nv in #10533
- [None][chore] Add test configurable moe module by @leslie-fang25 in #10575
- [https://nvbugs/5781589][fix] Implement pp skip forward for all spec workers. by @yuxianq in #10578
- [None][fix] Avoid write-write race for async pp send. by @yuxianq in #10488
- [https://nvbugs/5753788][chore] Padding empty chunk for configurable moe by @leslie-fang25 in #10451
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10589
- [None][chore] update allowlist 2026-01-13 by @tburt-nv in #10645
- [None][test] add test into qa test list by @xinhe-nv in #10627
- [None][test] Spark - Change testlist name and perf yml format by @JennyLiu-nv in #10626
- [None][chore] waive the CI failure by @xxi-nv in #10655
- [None][refactor] Unify the usage of MPIDist and TorchDist. by @yuxianq in #10380
- [None][fix] Reduce host over...
v1.2.0rc8
Highlights
-
Model Support
- Add export patch for GraniteMoe MoE models to enable torch.export compatibility (#10169)
- Eagle: qwen2 capture hidden states (#10091)
- Add pp support for DeepSeek-v3.2 (#10449)
- Pass lora_params through Qwen2/3 model forward (#10174)
- Fix export for microsoft/Phi-3-medium-128k-instruct (#10455)
- Mistral large 3 few code refine (#10405)
- EPD for Qwen3 VL (#10470)
- Remove some model support; add device constraint (#10563)
- Enable AttentionDP on Qwen3-VL and fix test (#10435)
-
API
- Add stability tags for serve subcommand (#10012)
-
Feature
- Better align MLA chunking with indexer chunking when chunked prefill enabled for DSV32 (#10552)
- Sm100 weight-only kernel (#10190)
- AutoTuner Cache: Support cache file lock and merge all ranks into one (#10336)
- Apply AutoTuner to AllReduce Op for strategy tuning (#8531)
- Add transferAgent binding (step 1) (#10113)
- Add the eos tokens in generation config to stop words in the sampler (#10389)
- Apply fusion for W4AFP8_AWQ MoE (#9838)
- Further reduce tuning time for cuteDSL nvFP4 dense gemm (#10339)
- Run sample_async on extra stream (#10215)
- Optimize qk rope/nope concat for DSA (#10571)
-
Fix
- Fix bug of Mistral-Small-3.1-24B-Instruct-2503 (#10394)
- Use 0 port as arbitrary port when disagg service discovery is enabled (#10383)
- Fix buffer reuse for CUDA graph attention metadata (#10393)
- Force release torch memory when LLM is destroyed (#10314)
- Swap TP-CP grouping order (#10350)
- TRTLLM MoE maps to lower tuning buckets when ep>1 (#9998)
- Fix draft token tree chain crash and depth=1 corner case (#10386, #10385)
- Fixed recursive node traversals (#10379)
- Fix undefined tokens_per_block (#10438)
- Skip spec dec for non-last rank (#10445)
- Setup dist before using autotuner (#10491)
- Fix broken cast (#9975)
- Fix sm120 speculation (#10049)
- Fix mamba_cache_manager when enabling cuda_graph_padding and let test cover this case (#9873)
- Choose register model config over root config for VLM (#10553)
-
Documentation
- Update SWA + spec dec support matrix (#10421)
- Add --config preference over --extra_llm_api_options in CODING_GUIDELINES.md (#10426)
- Adding parallelism types in feature combination matrix (#9849)
- Update GPTOSS Doc (#10536)
- Blog: Optimizing DeepSeek-V3.2 on NVIDIA Blackwell GPUs (#10565)
- Update Qwen3-Next doc by adding known issues section (#10582)
-
Test & Infra
- Add tests for DeepSeek v3.2 (#10561)
- Add accuracy tests for super-v3 with multiple-gpus (#10234)
- Layer-wise benchmarks: support TEP balance, polish slurm scripts (#10237)
- Add disag-serving kimi k2 thinking tests (#10357)
- Partition test_llm_pytorch.py for parallel execution (#10400)
- Only Use Throughput Metrics to Check Regression (#10404)
- Add vswa test cases coverage (#10146)
- Use random port in container port section (#10432)
- Remove redundant retries while binding to arbitrary port (#10452)
- Add qwen3-4b accuracy test case (#10382)
- Update kimi-k2-1k1k dataset (#10473)
- Fix concurrency list in Wide-EP perf tests (#10529)
- Restrict max_num_tokens in disagg mtp config (#10442)
- Add kimi_k2 single node perf test (#10436)
- Add MMMU test for mistral small (#10530)
- Workaround OCI-NRT slowdown issue (#10587)
What's Changed
- [#8391][chore] added deepseek_r1_distill_qwen_32b AutoDeploy perf test to L0 by @MrGeva in #10377
- [https://nvbugs/5670469][fix] Filter 0s and choose min of kv_head for Nemotron model by @farazkh80 in #10206
- [https://nvbugs/5772363][fix] fix bug of Mistral-Small-3.1-24B-Instruct-2503 by @byshiue in #10394
- [https://nvbugs/5649010][fix] use 0 port as arbitrary port when disagg service discovery is enabled by @reasonsolo in #10383
- [TRTLLM-10065][feat] Add accuracy tests for super-v3 with multiple-gpus by @Wanli-Jiang in #10234
- [https://nvbugs/5779534][fix] fix buffer reuse for CUDA graph attention metadata by @lfr-0531 in #10393
- [None][feat] sm100 weight-only kernel by @Njuapp in #10190
- [https://nvbugs/5701425][chore] Unwaive tests. by @yuxianq in #10269
- [None][feat] Layer-wise benchmarks: support TEP balance, polish slurm scripts by @yuantailing in #10237
- [None][infra] Waive failed cases in post-merge on 1/5 by @EmmaQiaoCh in #10399
- [TRTLLM-10185][feat] AutoTuner Cache: Support cache file lock and merge all ranks into one by @hyukn in #10336
- [TRTLLM-8242][feat] Add stability tags for serve subcommand by @LinPoly in #10012
- [https://nvbugs/5752521][fix] Unwaive test_trtllm_flashinfer_symbol_collision.py by @yihwang-nv in #10227
- [None][infra] Waive failed cases again on 1/5 by @EmmaQiaoCh in #10403
- [https://nvbugs/5715568][fix] Force to release torch memory when LLM is destroyed by @HuiGao-NV in #10314
- [TRTLLM-8821][feat] Apply AutoTuner to AllReduce Op for strategy tuning. by @hyukn in #8531
- [None][feat] update deepgemm to the DeepGEMM/nv_dev branch by @lfr-0531 in #9898
- [TRTLLM-9381][test] add disag-serving kimi k2 thinking tests by @xinhe-nv in #10357
- [#10374][fix] fixed race condition in AutoDeploy's mp tests port acquisition by @MrGeva in #10366
- [TRTLLM-9465][fix] Swap TP-CP grouping order by @brb-nv in #10350
- [None][perf] TRTLLM MoE maps to lower tuning buckets when ep>1 by @rosenrodt in #9998
- [TRTLLM-10053][feat] AutoDeploy: Add Super v3 config file, improve test runtime by @galagam in #10397
- [https://nvbugs/5772521][fix] Fix draft token tree chain crash by @mikeiovine in #10386
- [https://nvbugs/5772414][fix] Fix draft token tree depth=1 corner case by @mikeiovine in #10385
- [TRTLLM-9767][feat] Fixed recursive node traversals by @greg-kwasniewski1 in #10379
- [TRTLLM-9551][infra] Partition test_llm_pytorch.py for parallel execution by @Superjomn in #10400
- [https://nvbugs/5695984][fix] Unwaive llama3 eagle test by @mikeiovine in #10092
- [https://nvbugs/5745152][fix] Unwaive gpt oss spec decode test by @mikeiovine in #10370
- [#10170][fix] Add export patch for GraniteMoe MoE models to enable torch.export compatibility by @karthikvetrivel in #10169
- [https://nvbugs/5777044][chore] Remove solved bugs from waives.txt by @SimengLiu-nv in #10422
- [None][feat] precompiled installation from local src dir by @lucaslie in #10419
- [TRTLLM-9527][feat] Add transferAgent binding (step 1) by @chuangz0 in #10113
- [None][fix] Only Use Throughput Metrics to Check Regression by @chenfeiz0326 in #10404
- [None][feat] add the eos tokens in generation config to stop words in the sampler by @JadoTu in #10389
- [None][chore] Update SWA + spec dec support matrix by @mikeiovine in #10421
- [None][feat] CuteDSL MOE FC1 Enhancement by @liyuhannnnn in #10088
- [https://nvbugs/5726962][feat] Apply fusion for W4AFP8_AWQ MoE by @yumin066 in #9838
- [#2511][fix] eagle: qwen2 capture hidden states by @XiaoXuan42 in #10091
- [None][docs] Add
--configpreference over--extra_llm_api_optionsin CODING_GUIDELINES.md by @venkywonka in #10426 - [#8460][feat] Revive and simplify Model Explorer visualization integration by @karthikvetrivel in #10150
- [None][chore] unwaive qwen3 30b test by @kris1025 in #10115
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10384
- [None][test] update test case constraint by @crazydemo in #10381
- [https://nvbugs/5769926] [fix] Add no container mount home WAR by @kaiyux in #10431
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #10367
- [TRTLLM-9622][infra] Enable DGX_B300 multi-gpu testing in pre-merge pipeline by @yiqingy0 in #9699
- [TRTLLM-9896][test] add vswa test cases coverage by @crazydemo in #10146
- [None] [fix] Fix undefined tokens_per_block by @kaiyux in #10438
- [https://nvbugs/5772361][ci] Unwaive tests that have been fixed by @2ez4bz...
v1.2.0rc6.post1
Security Vulnerabilities
GnuPG Vulnerability
A security vulnerability has been identified in GnuPG versions prior to 2.4.9, which is present in the Ubuntu 24.04 LTS utilized by the TensorRT LLM base image. For details regarding this vulnerability, please refer to the official Ubuntu advisory: CVE-2025-68973. An official patched package for the Ubuntu system is currently pending. The fix will be included in the next release once the updated package is published and incorporated. To mitigate potential risks immediately, users are advised to manually upgrade GnuPG to version 2.4.9 or later.
Hugging Face Transformers Vulnerabilities
Several security vulnerabilities have been disclosed regarding the Hugging Face Transformers library used in TensorRT LLM. As these issues originate from an upstream dependency, remediation is dependent on the release of a patch by the Hugging Face team. We are actively monitoring the situation and will update TensorRT LLM to include the necessary fixes once a stable release of the Transformers library addressing these vulnerabilities becomes available. Affected CVEs: CVE-2025-14920, CVE-2025-14921, CVE-2025-14924, CVE-2025-14927, CVE-2025-14928, CVE-2025-14929, CVE-2025-14930
What's Changed
- [https://nvbugs/5708810][fix] Fix TRTLLMSampler by @moraxu in #9710
- [TRTLLM-9641][infra] Use public triton 3.5.0 in SBSA by @ZhanruiSunCh in #9652
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #9979
- [TRTLLM-9794][ci] move more test cases to gb200 by @QiJune in #9994
- [None][feat] Add routing support for the new model for both cutlass and trtllm moe backend by @ChristinaZ in #9792
- [TRTLLM-8310][feat] Add Qwen3-VL-MoE by @yechank-nvidia in #9689
- [https://nvbugs/5731717][fix] fixed flashinfer build race condition during test by @MrGeva in #9983
- [FMDL-1222][feat] Support weight and weight_scale padding for NVFP4 MoE cutlass by @Wanli-Jiang in #9358
- [None][chore] Update internal_cutlass_kernels artifacts by @yihwang-nv in #9992
- [None][docs] Add README for Nemotron Nano v3 by @2ez4bz in #10017
- [None][infra] Fixing credential loading in lockfile generation pipeline by @yuanjingx87 in #10020
- [https://nvbugs/5727952][fix] a pdl bug in trtllm-gen fmha kernels by @PerkzZheng in #9913
- [None][infra] Waive failed test for main branch on 12/16 by @EmmaQiaoCh in #10029
- [None][doc] Update CONTRIBUTING.md by @syuoni in #10023
- [None][fix] Fix Illegal Memory Access for CuteDSL Grouped GEMM by @syuoni in #10008
- [TRTLLM-9181][feat] improve disagg-server prometheus metrics; synchronize workers' clocks when workers are dynamic by @reasonsolo in #9726
- [None][chore] Final mass integration of release/1.1 by @mikeiovine in #9960
- [None][fix] Fix iteration stats for spec-dec by @achartier in #9855
- [https://nvbugs/5741060][fix] Fix pg op test by @shuyixiong in #9989
- [https://nvbugs/5635153][chore] Remove responses tests from waive list by @JunyiXu-nv in #10026
- [None] [feat] Enhancements to slurm scripts by @kaiyux in #10031
- [None][infra] Waive failed tests due to llm model files by @EmmaQiaoCh in #10068
- [None][fix] Enabled simultaneous support for low-precision combine and MTP. by @yilin-void in #9091
- [https://nvbugs/5698434][test] Add Qwen3-4B-Eagle3 One-model perf test by @yufeiwu-nv in #10041
- [TRTLLM-9998][fix] Change trtllm-gen MoE distributed tuning strategy back to INDEPENDENT by @hyukn in #10036
- [TRTLLM-9989][fix] Disable tvm_ffi for CuteDSL nvFP4 dense GEMM. by @hyukn in #10040
- [None][chore] Remove unnecessary warning log for tuning. by @hyukn in #10077
- [TRTLLM-9680][perf] Optimize TRTLLMSampler log_probs performance (Core fix has been merged via #9353) by @tongyuantongyu in #9655
- [None][chore] Bump version to 1.2.0rc6.post1 by @yiqingy0 in #10484
Full Changelog: v1.2.0rc6...v1.2.0rc6.post1
v1.2.0rc2.post1
Security Vulnerabilities
GnuPG Vulnerability
A security vulnerability has been identified in GnuPG versions prior to 2.4.9, which is present in the Ubuntu 24.04 LTS utilized by the TensorRT LLM base image. For details regarding this vulnerability, please refer to the official Ubuntu advisory: CVE-2025-68973. An official patched package for the Ubuntu system is currently pending. The fix will be included in the next release once the updated package is published and incorporated. To mitigate potential risks immediately, users are advised to manually upgrade GnuPG to version 2.4.9 or later.
Hugging Face Transformers Vulnerabilities
Several security vulnerabilities have been disclosed regarding the Hugging Face Transformers library used in TensorRT LLM. As these issues originate from an upstream dependency, remediation is dependent on the release of a patch by the Hugging Face team. We are actively monitoring the situation and will update TensorRT LLM to include the necessary fixes once a stable release of the Transformers library addressing these vulnerabilities becomes available. Affected CVEs: CVE-2025-14920, CVE-2025-14921, CVE-2025-14924, CVE-2025-14927, CVE-2025-14928, CVE-2025-14929, CVE-2025-14930
What's Changed
- [None][chore] Bump version to 1.2.0rc2.post1 by @yiqingy0 in #10286
- [TRTLLM-9752][fix] disable PDL for quant kernels by @bo-nv in #10288
Full Changelog: v1.2.0rc2...v1.2.0rc2.post1
v1.2.0rc7
Security Vulnerabilities
GnuPG Vulnerability
A security vulnerability has been identified in GnuPG versions prior to 2.4.9, which is present in the Ubuntu 24.04 LTS utilized by the TensorRT LLM base image. For details regarding this vulnerability, please refer to the official Ubuntu advisory: CVE-2025-68973. An official patched package for the Ubuntu system is currently pending. The fix will be included in the next release once the updated package is published and incorporated. To mitigate potential risks immediately, users are advised to manually upgrade GnuPG to version 2.4.9 or later.
Hugging Face Transformers Vulnerabilities
Several security vulnerabilities have been disclosed regarding the Hugging Face Transformers library used in TensorRT LLM. As these issues originate from an upstream dependency, remediation is dependent on the release of a patch by the Hugging Face team. We are actively monitoring the situation and will update TensorRT LLM to include the necessary fixes once a stable release of the Transformers library addressing these vulnerabilities becomes available. Affected CVEs: CVE-2025-14920, CVE-2025-14921, CVE-2025-14924, CVE-2025-14927, CVE-2025-14928, CVE-2025-14929, CVE-2025-14930
Highlights
-
Model Support
- Add Qwen3-VL-MoE (#9689)
- Support DeepSeek-V32 chat template (#9814)
- Support DeepSeek-V3.2, R1 and V3.1 tool parser (#10126, #10010)
- Support Eagle3 on Mistral Large3 (#9971)
- Support VLM part for Mistral Large 3 (#10188)
- Support multi-gpu running for nemotron-v3-nano and super (#10118)
- Support Qwen3-VL dense model in pytorch backend (#9060)
- Support NVFP4 for gptoss (#8956)
- Add MLA Based Eagle (#9677)
-
API
-
Feature
- Support NVFP4 weight and weight_scale padding for MoE cutlass (#9358)
- Add routing support for the new model for cutlass and TRTLLM MoE backend (#9792)
- Improve disagg-server prometheus metrics and synchronize dynamic workers’ clocks (#9726)
- Update TRT-LLM Gen MoE for NvFp4 + bias with tileN=256 (#9734)
- Add optimization options for MOE CuteDSL finalized kernel (#10042)
- Add fp8 bmm on sm120 (#9687)
- Reuse alltoall workspace for CuteDSL MoE output (#9840)
- Support Mooncake transfer engine as cache transceiver backend (#8309)
- Enable KV cache reuse for config database (#10094)
- Enable PDL for CuteDSL kernels and overlap MoeOutputMemset (#10043)
- Cudagraph updates for helix parallelism (#10141)
- Custom AllToAll for helix parallelism (#9986)
- Pass MRoPE tensors for EPD disagg (#9758)
- Reuse previous draft requests if possible (#10263)
- Make PDL enabled by default (#9695)
- Enable 2CTA with autotune for CuteDSL MoE and Grouped GEMM optimizations (#10201)
- Provide attention NVFP4 out support for torch compile (#9740)
- Increase topk upper limit to 22 for NVLinkOneSided AlltoAll (#10229)
- Deliver optimizations for two-model speculative decoding (#10208)
-
Fix
- Fix PDL bug in trtllm-gen FMHA kernels (#9913)
- Fix Illegal Memory Access for CuteDSL Grouped GEMM (#10008)
- Disable tvm_ffi for CuteDSL nvFP4 dense GEMM (#10040)
- Fix ready signal in NIXL backend (#10000)
- Fix top_k=10 in NVLinkOneSided AlltoAll (#10197)
- Fix race conditions in KV cache communication during unexpected termination (#10076)
- Fix deepseek sharding (#9984)
- Fix contiguous view usage in load_expert weights (#10136)
- Fix detokenizer issue for DeepSeek-v3.2 (#10106)
- Fix indice offset overflow in custom Top-K kernel and UT (#10027)
- Fix draft_lengths for CUDA graph capture (#10004)
- Fix port conflict handling for CI (#10392, #10175, #10035)
- Fix NVFP4 linear method weight and weight_scale padding (#10148)
- Fix VSWA block store/load scheme in KV cache manager (#10183)
- Fix ready signal and execution_stream synchronization across components (#10060)
- Fix PP+CP combination with helix parallelism (#10312)
- Fix Gemma3 RoPE for local attention (#9961)
- Make NCCL resource manager destructor exception-safe (#10166)
- Fix detokenizer / tokenizer issues (use local tokenizer, cache vocab) (#10230, #10219)
- Disable PDL for quant kernels to address accuracy (#10285)
- Fix hilo: Avoid property with setter in nn modules (#10212)
-
Documentation
- Add README for Nemotron Nano v3 (#10017)
- Update CONTRIBUTING.md (#10023)
- Update online benchmarking docs (#9611)
- Update Dynamo Example document (#9619, #10368)
- Update Perf_Overview.md with benchmarking results (#9723)
- Add NIXL-Libfabric usage documentation (#10205)
- Add Sparse Attention feature doc (#9648)
- Update IFB performance guide & GPTOSS deployment guide (#10283)
- Promote perfect MoE router feature documentation (#10303)
-
Test & Infra
- Fix credential loading in lockfile generation pipeline (#10020)
- Add Qwen3-4B-Eagle3 one-model perf test (#10041)
- Add regression testing for config database (#9832)
- Update tests for nemotron_h (#9993)
- Use ucx as default backend (#10101)
- Fix OpenSearch URL in slurm_launch.sh for multinode perf sanity (#9990)
- Remove helix test from RTX test list (#10224)
- Add ray test robustness and RL perf reproduce script (#9939)
- Support multi-node disagg perf test in CI (#9138)
- Enable single-gpu CI on spark (#9304)
- Add disaggregated stress test (#9354)
- Include LongBenchV1 in trtllm-eval (eval infra aspect) (#10265)
- Fix port conflict avoidance in CI via get_free_port_in_ci (#10392)
What's Changed
- [https://nvbugs/5708810][fix] Fix TRTLLMSampler by @moraxu in #9710
- [TRTLLM-9641][infra] Use public triton 3.5.0 in SBSA by @ZhanruiSunCh in #9652
- [TRTLLM-8638][fix] Add failed cases into waives.txt by @xinhe-nv in #9979
- [TRTLLM-9794][ci] move more test cases to gb200 by @QiJune in #9994
- [None][feat] Add routing support for the new model for both cutlass and trtllm moe backend by @ChristinaZ in #9792
- [TRTLLM-8310][feat] Add Qwen3-VL-MoE by @yechank-nvidia in #9689
- [https://nvbugs/5731717][fix] fixed flashinfer build race condition during test by @MrGeva in #9983
- [FMDL-1222][feat] Support weight and weight_scale padding for NVFP4 MoE cutlass by @Wanli-Jiang in #9358
- [None][chore] Update internal_cutlass_kernels artifacts by @yihwang-nv in #9992
- [None][docs] Add README for Nemotron Nano v3 by @2ez4bz in #10017
- [None][infra] Fixing credential loading in lockfile generation pipeline by @yuanjingx87 in #10020
- [https://nvbugs/5727952][fix] a pdl bug in trtllm-gen fmha kernels by @PerkzZheng in #9913
- [None][infra] Waive failed test for main branch on 12/16 by @EmmaQiaoCh in #10029
- [None][doc] Update CONTRIBUTING.md by @syuoni in #10023
- [None][fix] Fix Illegal Memory Access for CuteDSL Grouped GEMM by @syuoni in #10008
- [TRTLLM-9181][feat] improve disagg-server prometheus metrics; synchronize workers' clocks when workers are dynamic by @reasonsolo in #9726
- [None][chore] Final mass integration of release/1.1 by @mikeiovine in #9960
- [None][fix] Fix iteration stats for spec-dec by @achartier in #9855
- [https://nvbugs/5741060][fix] Fix pg op test by @shuyixiong in #9989
- [https://nvbugs/5635153][chore] Remove responses tests from waive list by @JunyiXu-nv in #10026
- [None] [feat] Enhancements to slurm scripts by @kaiyux in #10031
- [None][infra] Waive failed tests due to llm model files by @EmmaQiaoCh in #10068
- [None][fix] Enabled simultaneous support for low-precision combine and MTP. by @yilin-void in #9091
- [https://nvbugs/5698434][test] Add Qwen3-4B-Eagle3 One-model perf test by @yufeiwu-nv in #10041
- [TRTLLM-9998][fix] Change trtllm-gen MoE distributed tuning strategy back to INDEPENDENT by @hyukn in #10036
- [TRTLLM-9989][fix] Disable tvm_ffi for CuteDSL nvFP4 dense GEMM. by @hyukn in #10040
- [None][chore] Remove unnecessary warning log for tuning. by @hyukn in #10077
- [TRTLLM-9680][perf] Optimize TRTLLMSampler log_probs performance (Core fix has been merged via #9353) by @tongyuantongyu in https://github.com/NVIDIA/Ten...
v1.2.0rc6
Highlights
-
Model Support
-
API
-
Feature
- 2D parallel EP TP support (#9459)
- Fused kernels (qknormrope + moe routing) and two-model MTP support for glm4moe (#9852)
- Add gather fc1 kernel by cuteDSL (#9618)
- Add GB300 support since it does not support segment (#9731)
- Add helixPostProcessNative kernel for cp_dim=2 (#9924)
- Added symetric memory AllReduce strategy (#8919)
- ConfigurableMoE support (#9772, #9858)
- Enable multistream for Linear Attention in Qwen3 (#9696)
- Enable PDL for indexer topK (#9843)
- Implement distributed tuning system (#9621)
- Implement sampling on 1-model EAGLE3 (#9885)
- Move D->H copies to a worker thread (#8463)
- Optimize the host overhead of _sample_async (#9935)
- Port fp4 quantization kernel optimization from FlashInfer (#9854)
- Support larger topK for NVLinkOneSided AlltoAll. (#9816)
-
Fix
- Fix CUDA stream sync issue in ModelRunnerCPP (#6426)
- Fix accuracy issue in TRTLLM MoE (#9999)
- Fix PDL in TRTLLM MOE for dsv3 (#9799)
- Fix unterminated process issue for RemoteOpenAIServer (#9490)
- Fix PDL bugs with trtllm-gen fmha kernels (#9863)
- Use first PP rank's schedule result in other PP ranks to fix PP hang (#9659)
-
Documentation
-
Test & Infra
What's Changed
- [https://nvbugs/5703953][fix] Preserving ip:port for trtllm-serve before initializing llm by @JunyiXu-nv in #9646
- [None][infra] Waive failed cases for main branch on 12/07 by @EmmaQiaoCh in #9769
- [None][fix] Several minor fixes to CI setting by @chzblych in #9765
- [OMNIML-3036][doc] Re-branding TensorRT-Model-Optimizer as Nvidia Model-Optimizer by @cjluo-nv in #9679
- [None][feat] Enable NCCL_SYMMETRIC as default fallback for AllReduce by @nv-lschneider in #9314
- [TRTLLM-9000][feat] Add multi-node Perf Tests into CI by @chenfeiz0326 in #8800
- [None][test] add ntp tolerance in time metrics verification by @zhengd-nv in #9741
- [TRTLLM-9603][feat] Enable ConfigurableMoE test in the CI by @xxi-nv in #9645
- [https://nvbugs/5422621][test] Add GB 200 WIDEEP test case for RCCA 5422621 by @fredricz-20070104 in #9506
- [None][fix] Fix two tuning cache miss issues. by @hyukn in #9743
- [TRTLLM-9706] [doc] Update wide EP documents by @kaiyux in #9724
- [https://nvbugs/5666804][test] only adding sampler config for limited models by @ruodil in #9512
- [None][infra] Waive failed cases for main on 12/08 by @EmmaQiaoCh in #9773
- [None][chore] Move the rocketkv e2e test to post-merge by @lfr-0531 in #9768
- [None][chore] Enable tvm_ffi for cute dsl nvfp4_gemm to reduce host overhead. by @limin2021 in #9690
- [TRTLLM-9431][perf] Enable multistream for Linear Attention in Qwen3-… by @nv-guomingz in #9696
- [None][chore] Remove closed bugs by @xinhe-nv in #9770
- [None][infra] update mooncake in docker images by @zhengd-nv in #9584
- [None][test] Add Kimi k2 WIDEEP perf and accuracy cases by @fredricz-20070104 in #9686
- [https://nvbugs/5527655][test] Add test case for RCCA 5527655 by @fredricz-20070104 in #9511
- [http://nvbugs/5649010][fix] fix test_auto_scaling.py::test_worker_restart timeout by @reasonsolo in #9775
- [None][fix] Switch AutoDeploy's default allreduce strategy to NCCL by @MrGeva in #9666
- [TRTLLM-9506][fix] Fix AR for DeepSeek-R1 2 model path by @sunnyqgg in #9661
- [TRTLLM-9089][chore] Port prepare_dataset into trtllm-bench by @FrankD412 in #9250
- [https://nvbugs/5567586][feat] Ampere xqa swa specdec for GPT-OSS Eagle3-one-model by @jhaotingc in #8383
- [TRTLLM-7967][chore] Add more tests by @yibinl-nvidia in #9415
- [https://nvbugs/5508267][fix] Proper handling of inactive canceled requests by @thorjohnsen in #9280
- [#8921][feat] Added symetric memory AllReduce strategy by @MrGeva in #8919
- [None][fix] Fix #8383 introduced TRTLLM backend python error by @jhaotingc in #9804
- [#9753][feat] AutoDeploy: Implement add rms_norm fusion by @nvchenghaoz in #9754
- [None][infra] Correct the waived test names due to a merge conflict by @yuanjingx87 in #9803
- [None][fix] Fix PDL in TRTLLM MOE for dsv3 by @dmtri35 in #9799
- [None][feat] Add llama4 scaling by @byshiue in #9771
- [https://nvbugs/5677746][fix] Use first PP rank's schedule result in other PP ranks to fix PP hang by @jiaganc in #9659
- [None][fix] Fix unterminated process issue for RemoteOpenAIServer by @JunyiXu-nv in #9490
- [https://nvbugs/5726066][infra] Waive timeout disaggregated/test_auto_scaling tests. by @bobboli in #9815
- [None][chore] Fix tests failing on pre-merge 12/08 by @brb-nv in #9819
- [https://nvbugs/5722653][fix] Fix config file used by disagg_client by @JunyiXu-nv in #9783
- [TRTLLM-6537][chore] Shorten the time limit for dis-agg accuracy testing by @Shixiaowei02 in #9614
- [None][infra] Use artifactory pypi mirror for Cython install by @ZhanruiSunCh in #9774
- [TRTLLM-9794][ci] remove duplicated test cases in DGX B200 by @QiJune in #9817
- [None][test] Refactor qa/llm_perf_nim.yml test list by @yufeiwu-nv in #9700
- [None][chore] Generate lock file for release/1.2.0rc4.post1 branch automatically by @yiqingy0 in #9829
- [None][fix] Additional model outputs for pipeline parallelism by @Funatiq in #9794
- [TRTLLM-6756][feat] Update BeamSearch for TorchSampler by @stnie in #9660
- [TRTLLM-9794][ci] move qwen3-next test cases to gb200 by @QiJune in #9827
- [None][infra] Waive failed cases for main branch on 12/09 by @EmmaQiaoCh in #9839
- [https://nvbugs/5575841] [fix] Nvbug 5575841: Remove additional test waivers for TestMoEFP4 by @DomBrown in #9788
- [None][feat] Make 2-model spec dec use the 1-model kernels (Hopper) by @mikeiovine in #8810
- [None][chore] Adding flaky auto scaling test to waives by @pcastonguay in #9851
- [#8921][chore] AutoDeploy NanoV3 to use SYMM_MEM allreduce strategy by @MrGeva in #9797
- [TRTINFRA-7328][infra] Consume SlurmCluster scratchPath and cleanup mounts by @mlefeb01 in #9600
- [https://nvbugs/5688388][chore] Unwaiving fixed disagg test by @pcastonguay in #9800
- [https://nvbugs/5719561][chore] Unwaive tests for nvbug 5719561 by @pcastonguay in #9801
- [https://nvbugs/5508301][feat] Move D->H copies to a worker thread whe… by @dhansen-nvidia in #8463
- [None][chore] Add unittest for otlp tracing by @zhanghaotong in #8716
- [None][chore] Support larger topK for NVLinkOneSided AlltoAll. by @bobboli in #9816
- [TRTLLM-9794][ci] move some deepseek test cases to gb200 by @QiJune in #9841
- [TRTLLM-9661][fix] Fix nvfp4 gemm allowed backends arg passing by @hyukn in #9837
- [https://nvbugs/5702791][fix] Unwaive fixed test by @dominicshanshan in #9844
- [TRTLLM-...
v1.1.0
Know Issue
-
If users create project with tensorrt-llm==1.1.0 in pyproject.toml file as dependency as below:
dependencies = [ "tensorrt-llm==1.1.0", ]
when users install project dependencies with command
uv sync, error will happend with message:No solution found when resolving dependencies for split (markers: python_full_version >= '3.13' and sys_platform == 'darwin'): ╰─▶ Because patchelf==0.18.0.0 was yanked (reason: https://github.com/mayeut/patchelf-pypi/issues/87) and tensorrt-llm==1.1.0 depends on patchelf==0.18.0, we can conclude that tensorrt-llm==1.1.0 cannot be used. And because your project depends on tensorrt-llm==1.1.0, we can conclude that your project's requirements are unsatisfiable.".That's because patchelf 0.18.0 was yanked by author.
A valid work around for this issue is to add block in pyproject.toml:
[tool.uv] override-dependencies = [ "patchelf==0.17.2.4", ]
What's Changed
- [None][chore] Bump version to 1.1.0rc0 by @yiqingy0 in #6651
- [TRTLLM-6683][feat] Support LoRA reload CPU cache evicted adapter by @amitz-nv in #6510
- [None][test] correct test-db context for perf yaml file by @ruodil in #6686
- [None] [feat] Add model gpt-oss by @hlu1 in #6645
- [https://nvbugs/5409414][fix] fix Not registered specs by @xinhe-nv in #6660
- [None][feat] : Add FP8 context MLA support for SM120 by @peaceh-nv in #6059
- [TRTLLM-6092][doc] Add LoRA feature usage doc by @shaharmor98 in #6603
- [TRTLLM-6409][feat] Enable guided decoding with speculative decoding (part 1: two-model engine) by @syuoni in #6300
- [TRTLLM-6881][feat] Include attention dp rank info with KV cache events by @pcastonguay in #6563
- [None][infra] Fix guardwords by @EmmaQiaoCh in #6711
- [None][package] Pin cuda-python version to >=12,<13 by @yiqingy0 in #6702
- [None][doc] Add deployment guide section to the official doc website by @nv-guomingz in #6669
- [None][fix] disagg ctx pp4 + gen pp4 integ test by @raayandhar in #6489
- [None][feat] Clean up ngram auto mode, add max_concurrency to configs by @mikeiovine in #6676
- [None][chore] Remove py_executor from disagg gh team by @pcastonguay in #6716
- [https://nvbugs/5423962][fix] Address broken links by @chenopis in #6531
- [None][fix] Migrate to new cuda binding package name by @tongyuantongyu in #6700
- [https://nvbugs/5410687][fix] Hopper w4a8 groupwise MoE interleave by @symphonylyh in #6708
- [None][feat] Add NCCL Symmetric Integration for All Reduce by @Tabrizian in #4500
- [TRTLLM-6785][feat] BREAKING CHANGE Enable TRTLLM sampler by default by @dcampora in #6216
- [TRTQA-2920][fix] Add failed cases into waives.txt by @xinhe-nv in #6719
- [TRTLLM-5252][test] add for mistral_small_3.1_24b perf test by @ruodil in #6685
- [TRTLLM-6744][feat] Remove input_sf swizzle for module WideEPMoE by @StudyingShao in #6231
- [None][fix] Fix unnecessary GPU synchronization in torch sampler caused by incorrect tensor reference by @zhanghaotong in #6626
- [TRTLLM-6854][feat] Enable guided decoding with disagg serving by @syuoni in #6704
- [TRTLLM-5252][fix] Propagate mapping to intermediate layers by @2ez4bz in #6611
- [None][test] fix yml condition error under qa folder by @ruodil in #6734
- [None][doc] Add doc for multimodal feature support matrix by @chang-l in #6619
- [TRTLLM-6898][feat] make fused_moe_cute_dsl work on blackwell by @limin2021 in #6616
- [https://nvbugs/5436461][infra] Adjust free_gpu_memory_fraction of test_eagle3 to prevent OOM on CI by @leslie-fang25 in #6631
- [None][refactor] Combine resmooth_to_fp8_e8m0 and transform_sf_into_required_layout by @yuxianq in #6654
- [https://nvbugs/5437106][fix] Fix llama4 scout TRTLLM attn_backend by @JunyiXu-nv in #6690
- [None][fix] Remove lock related typo in py_executor by @lancelly in #6653
- [None][feat] move kv cache measure into transfer session by @zhengd-nv in #6633
- [None][fix]revert kvcache transfer by @chuangz0 in #6709
- [TRTLLM-6650][fix] Enhance CUDA graph + Beam search to correctly handle padding by @stnie in #6665
- [TRTLLM-6308][feat] Support Aggregate mode for phi4-mm by @Wanli-Jiang in #6184
- [None][feat] Optimize CUDA graph memory usage for spec decode cases by @mikeiovine in #6718
- [TRTLLM-7025] [infra] Reorganize CODEOWNERS to rectify
examplesmapping by @venkywonka in #6762 - [None][doc] Move AutoDeploy README.md to torch docs by @Fridah-nv in #6528
- [None][fix] WAR GPT OSS on H20 with Triton MOE by @dongfengy in #6721
- [TRTLLM-6420][feat] add support for Eclairv2 model - cherry-pick changes and minor fix by @yibinl-nvidia in #6493
- [None][feat] Core Metrics Implementation by @hcyezhang in #5785
- [https://nvbugs/5398180][feat] Improve Llama4 performance for small max_seqlen cases by @nv-yilinf in #6306
- [TRTLLM-6637][feat] Resolve KV cache divergence issue by @ziyixiong-nv in #6628
- [None][infra] Waive test main 0808 by @EmmaQiaoCh in #6751
- [#5048][enhance] AutoDeploy: Optimize prepare_inputs by @galagam in #6634
- [None][chore] Dead code elimination, we no longer record/fetch through WindowBlockManager:: mContextBlocksByHash by @eopXD in #6249
- [TRTLLM-6174][feat] Enable FP32 mamba ssm cache by @shaharmor98 in #6574
- [https://nvbugs/5444937][fix] Fixing kv_cache_event unit test by @pcastonguay in #6753
- [TRTLLM-6823][doc] Add checkpoint refactor docs by @shaharmor98 in #6592
- [None][feat] Support SharedTensor on MultimodalParams by @yechank-nvidia in #6254
- [None][feat] improve dataloading for benchmark_dataset by using batch… by @zerollzeng in #6548
- [https://nvbugs/5431127][fix] Run test_disaggregated_deepseek_v3_lite_fp8_nixl[DeepSeek-V3-Lite-fp8] only on hopper by @bo-nv in #6736
- [None][fix] fix same pp disagg by @chuangz0 in #6730
- [None][feat] Add gpt-oss GSM8K test. by @Tracin in #6732
- [None][test] Test trtllm-bench AD vs, PT BEs on H100 single gpu by @MrGeva in #6487
- [TRTLLM-5633][infra] Force set changed file diff to empty string for post-merge CI by @yiqingy0 in #6777
- [None][chore] remove closed bugs by @xinhe-nv in #6772
- [None][infra] Waive failed tests on main 0811 by @EmmaQiaoCh in #6778
- fix: Ensure that Python stub generation works against libnvidia-ml stubs by @MartinMarciniszyn in #6188
- [TRTLLM-5532][feat] store the block of context request into kv cache by @byshiue in #6683
- [None][doc] Add K2 tool calling examples by @lancelly in #6667
- [None][infra] Unwaive an updated case to test by @EmmaQiaoCh in #6791
- [None][chore] always try-catch when clear build folder in build_wheel.py by @zhenhuaw-me in #6748
- [TRTLLM-6812][feat] Add standardized GitHub issue templates and disable blank issues by @venkywonka in #6494
- [None][fix] Refactoring to avoid circular import when importing torch models by @rakib-hasan in #6720
- [None][chore] Find LLM_ROOT and LLM_BACKEND_ROOT dynamically by @achartier in #6763
- [https://nvbugs/5385987][fix] Fix Qwen2 quantization issue by pinning transformers version by @ch...
v1.2.0rc5
Announcement Highlights
-
Vulnerability
- Two security vulnerabilities have been identified in the urllib3 package versions >= 1.24 and < 2.6.0. These issues will be addressed in the next release. For detailed information on the vulnerabilities, refer to the following advisories:
GHSA-gm62-xv2j-4w53
GHSA-2xpw-w6gg-jr37
To mitigate the issues immediately, users are advised to upgrade urllib3 to version 2.6.0 or later.
- Two security vulnerabilities have been identified in the urllib3 package versions >= 1.24 and < 2.6.0. These issues will be addressed in the next release. For detailed information on the vulnerabilities, refer to the following advisories:
-
Model Support
- Slimmed down implementation of Nemotron H (#9235)
- Add support Starcoder2 PyTorch backend (#8923)
- Add support MLA chunked prefill for DeepSeek V3.2 model (#9376)
- Add support AutoDeploy Nemotron-Flash (#9504)
- AutoDeploy: Add support Llama4 MoE handling (#9556)
- Add support for nano-v3 and super-v3 with PyTorch backend (#9261)
- AutoDeploy: Add support for nano v3 to custom implementation (#9465)
-
API
-
Feature
- Add support for KVCache reuse for DeepSeek V3.2 (#9383)
- Support Yarn on QwQ-32B model (#9059)
- Update DeepGEMM to include optimizations for DeepSeek-v3.2 (#9380)
- Cold L2 cache when doing autotune benchmarking (#8779)
- Improve TRTLLM MoE throughput for small hidden size (#9377)
- Add parser to layer-wise benchmarks (#9440)
- Support custom chat template for tool calling (#9297)
- Add draft token tree runtime on CDL (#8586)
- Top-p optimization by removing redundant softmax (#9411)
- Use FlashInfer's top_k_sampling_from_probs (#9457)
- Overlap context chunks in pipeline parallel mode (#9308)
- Improve all-to-all perf for large CP size in Helix (#9494)
- Support more accurate AR calculation (#9323)
- Support custom config of sharding (#9143)
- Integrate helix parallelism (#9342)
- Optimize RocketKV algorithm (#9333)
- Extend cute_dsl_nvfp4_gemm to sm103 (#9543)
- Add chat template kwargs support to longbench-v2 (#9544)
- Add Beam Search to TorchSampler (#8509)
- Unify nvfp4 gemm backend (#8963)
- Use FlashInfer.sampling by default (#9545)
- Add RocketKV usage doc and e2e accuracy test on LongBenchV2 (#9572)
- Alias to comply to LlmArgs (#9586)
- Update trtllm-gen nvfp4 kernels with better performance (#9510)
- Enable CuteDSL MoE with Large EP (#9592)
- Convert cuteDSL GEMM to opt-in feature (#9682)
- Optimize the load_weights method to include mapping parameter (#9583)
- Support torch compile for pipeline parallel Llama and DeepSeekV3 (#7838)
- Check if executor is shutdown in /health entrypoint (#9057)
- Add NIXL-LIBFABRIC support (#9225)
- Decouple disagg service from FastAPI (#8714)
- AutoDeploy: Add NVFP4 Cutlass MoE kernels (#9551)
- AutoDeploy: Draft Target Speculative Decoding (#9275)
- AutoDeploy: Support TRTLLM Sampler (#9641)
- AutoDeploy: Perf optimization for Attention and rmsnorm (#9719)
- AutoDeploy: Use router gemm op for Nemotron MOE (#9500)
- AutoDeploy: Remove redundant copies in mamba layers (#9461)
- AutoDeploy: Add A_log fusion for Mamba layers (#9422)
- AutoDeploy: Update dist ops (if not already) (#9301)
- AutoDeploy: Perf optimization entries (if not already in Feature) (#9719)
-
Fix
- Modify qwen3-next sampling stop_tokens (#9331)
- Fix mismatched nvfp4 gemm sf shape (#9336)
- Enhance warning in cacheTransBuffer (#9390)
- Fix top-k outIndices with vectorized_process (#9404)
- Let KV cache manager block initialization respect dry run (#9093)
- Avoid cudaFree overlap with cuda graph (#9438)
- Fix TP support for DeepSeek-V3.2 on Hopper (#9484)
- Fix Qwen3-235B ATP accuracy issue with PDL (#9530)
- Correct virtual memory allocation alignment (#9491)
- Fix view operation on uncontiguous tensor (#9576)
- Extract GPU count from single-node stage names (#9599)
- Refine Piecewise Cuda Graph condition for DP (#9393)
- Enhance RPC robustness (#8711)
- Fix synchronization bugs in KvCacheTransferManager preventing corrupted blocks (#9056)
- Fix dist-serving performance by clearing CPU affinity (#9549)
- Fix wide ep MoE error (#9642)
- Fix LoRa enablement for GPT OSS Torch (#8253)
- Recover TRTLLM MoE performance for DEP (#9562)
- Fix error when processing batches containing both text and multimodal data (#8381)
- Fix deepseek_fp8_block_scales using 2D x_sf in TRTLLMGEN-MoE (#9658)
- Enable hmac in RPC (#9745)
- Start disagg workers and servers on free ports (#9694)
- Fix bug: deepseek_fp8_block_scales uses 2D x_sf instead of 1D (#9658)
- AutoDeploy: fix nano sharding config (#9668)
- AutoDeploy: Remove auto-tuner from nvfp4_gemm forward (#9497)
-
Documentation
- Fix math formula rendering issues (#9481)
- Qwen3 deployment guide (#9488)
- KV Connector Docs (#9325)
- Deployment Guide for Kimi K2 Thinking on TensorRT LLM - Blackwell (#9711)
- Add feature docs for helix parallelism (#9684)
- Add examples showcasing OpenAI compatible APIs (#9520)
- Update Linux installation guide (#9485)
- Refine the slurm examples (#9548)
- Link to modelopt checkpoints in quick start guide (#9571)
-
Test & Infra
- Rename AlltoAll backend names (#9329)
- Move build config from BaseLlmArgs to TrtLlmArgs (#9249)
- Reduce nested nvtx ranges (#9347)
- Add disagg and wideep multi-node multi-gpu test cases (#9356)
- Upgrade CuteDSL to 4.3.0 (#9444)
- Use flexcache for gh200 nodes (#9405)
- Evaluate helix parallelism with DSV3 Lite (#9597)
- AutoDeploy update cuda stream manager for multi-device (#9575)
- Add container notices and documentation (#9185)
- Increase warmup times in multi-gpu testing (#9578)
What's Changed
- [#9316][feat] AutoDeploy: Add the accuracy test for Nemotron MOE models by @nvchenghaoz in #9317
- [#9096][feature] Auto Deploy: configurable fused MoE backend by @nzmora-nvidia in #9194
- [None][fix] Use fp32 for indexer weight_proj GEMM by @chang-l in #9243
- [None][fix] Multimodal InputProcessor dummy builder fix by @yechank-nvidia in #8916
- [None][ci] waive test_disagg_server_restart by @QiJune in #9326
- [None][chore] Revise the description of enable_autotuner. by @hyukn in #9320
- [TRTLLM-9295][fix] use greedy decoding in test_openai_compatible_json_schema by @ixlmar in #9305
- [TRTLLM-9164][infra] Enable checking duplicate items in waives.txt in pre-commit by @EmmaQiaoCh in #9265
- [#9236][feature] Make sharing of activation_type across SW layers more robust by @nzmora-nvidia in #9238
- [https://nvbugs/5667687][fix] Set correct lm_head_tp_size_upper_bound by @lancelly in #9300
- [https://nvbugs/5667454][test] Fix Test Case as Chunked Attention not Supported on sm_120 by @yufeiwu-nv in #9260
- [None][chore] Weekly mass integration of release/1.1 by @mikeiovine in #8918
- [None][chore] Upgrade starlette and FastAPI by @tburt-nv in #9319
- [None][infra] Update goggles_action repository by @karljang in #9240
- [TRTLLM-9197][infra] Move thirdparty stuff to it's own listfile by @cheshirekow in #8986
- [TRI-332] [fix] Fix L0_backend_trtllm by @yinggeh in #9282
- [None][ci] waive test_llm_context_only_timed_out_kv_cache_exhausted by @QiJune in #9351
- [None][infra] Add fallback when get wheel from build stage is fail by @ZhanruiSunCh in #9290
- [TRTLLM-9183][infra] Add --waives-file in rerun pytest command by @yiqingy0 in #8971
- [TRTLLM-8957][feat] create communication related classes by @xxi-nv in #8968
- [None][chore] Add periodic junit xml path in conftest by @crazydemo in #9337
- [None][ci] waive a test case of test_ad_build_small_multi.py by @QiJune in #9355
- [None][infra] Waive failed cases in main post-merge on 11/21 by @EmmaQiaoCh in #9360
- [None][chore] Bump version to 1.2.0rc4 by @yiqingy0 in #9363
- [TRTLLM-8650][fix] beam search request validation (#8433) by @ixlmar in #9228
- [TRTLLM-9191][feat] support out-of-tree models in trtllm-serve by @ixlmar in #9269
- [https://nvbugs/5629833][fix] Don't fill tensors by @HuiGao-NV in #9296
- [None][feat] TRT-LLM Gen MoE optimize DeepSeek Fp8 activation kernel by @nekorobov in #9175
- [https://nvbugs/5590408][fix] Fallback to greedy sampling in two-model overlap scheduler by @ziyixiong-nv in #9321
- [TRTLLM-9208][infra] Document the process for C++ deps by @cheshirekow in #9016
- [TRTLLM-9370][feat] Integration of CuteDSL NVFP4 grouped GEMM (Part 2: SwiGLU Fusion and Finalize Fusion) by @syuoni in #9288
- [None][feat] Eagle: PostNorm and multilayer options by @IzzyPutterman in https:...