-
Notifications
You must be signed in to change notification settings - Fork 2k
[None][feat] Integrate cuda.tile RMS norm kernels #9725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
📝 WalkthroughWalkthroughThis change introduces CUDA tile-based RMSNorm kernels with optional residual fusion and integrates them as an optional path into the existing RMSNorm module. It adds utility functions for CUDA tile availability detection, three kernel variants (standard, gather, and static-persistent) for both regular and fused-residual normalization, a custom Torch operation layer, and conditional re-exports. Changes
Sequence Diagram(s)sequenceDiagram
participant User as User Code
participant RMSNorm as RMSNorm Module
participant CustomOp as Custom Op<br/>(cuda_tile_rms_norm)
participant Kernel as CUDA Tile Kernel
User->>RMSNorm: forward(x, weight,<br/>residual=None)
alt use_cuda_tile=True
RMSNorm->>RMSNorm: Check residual present
alt residual provided
RMSNorm->>CustomOp: cuda_tile_rms_norm_fuse_residual_<br/>(x, residual, weight, ...)
else residual=None
RMSNorm->>CustomOp: cuda_tile_rms_norm<br/>(x, weight, ...)
end
CustomOp->>CustomOp: Determine kernel path<br/>(static_persistent,<br/>gather flags)
CustomOp->>Kernel: Launch selected kernel<br/>with tile config
Kernel-->>CustomOp: return output/mutate x,residual
CustomOp-->>RMSNorm: return output
else use_cuda_tile=False
RMSNorm->>RMSNorm: Use existing FlashInFER path
end
RMSNorm-->>User: return normalized output
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Nitpick comments (5)
tensorrt_llm/_torch/modules/rms_norm.py (1)
93-93: Clarify or track thegather=Falsedevice assertion failure issue.The comment indicates
gather=Falsecauses subsequent device assertion failures. Consider opening an issue to track this limitation, or add a more detailed explanation of the root cause and conditions under which it occurs.Would you like me to open an issue to track this
gather=Falselimitation?tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py (1)
108-109: Consider usingValueErrorinstead ofassertfor input validation.Assertions can be disabled with
-Oflag. For public API boundaries, explicit exceptions provide clearer error messages and aren't stripped in optimized mode.- assert x.is_contiguous(), "x must be contiguous for in-place operation" - assert residual.is_contiguous(), "residual must be contiguous for in-place operation" + if not x.is_contiguous(): + raise ValueError("x must be contiguous for in-place operation") + if not residual.is_contiguous(): + raise ValueError("residual must be contiguous for in-place operation")tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py (1)
167-171: Remove dead code:b_broadcastedis always zero.The bias tensor
b_broadcastedis hardcoded to 0.0 and the subsequentct.add(y, b_broadcasted)is a no-op that wastes computation.# Step 6: Apply linear transformation # Broadcast weight to match input shape w_broadcasted = ct.reshape(w, (1, TILE_SIZE_N)) - b_broadcasted = ct.full((1, TILE_SIZE_N), 0.0, dtype=ct.float32) - # Apply linear transformation: y = x_normalized * w + b + # Apply weight scaling: y = x_normalized * w y = ct.mul(x_normalized, w_broadcasted) - y = ct.add(y, b_broadcasted)tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py (2)
155-155: Consider using ct.cdiv for consistency.The manual ceiling division here differs from the approach in the other two kernels (lines 22, 95), which use
ct.cdiv. For consistency and clarity, consider using the same pattern.- upper_bound = (M + TILE_SIZE_M - 1) // TILE_SIZE_M + upper_bound = ct.cdiv(M, TILE_SIZE_M)
214-218: Remove unnecessary zero bias.Lines 214 and 218 create and add a zero bias tensor, which has no functional effect. This appears to be dead code or a placeholder for future bias support.
# Step 6: Apply linear transformation # Broadcast weight to match input shape w_broadcasted = ct.reshape(w, (1, TILE_SIZE_N)) - b_broadcasted = ct.full((1, TILE_SIZE_N), 0.0, dtype=ct.float32) # Apply linear transformation: y = x_normalized * w + b y = ct.mul(x_normalized, w_broadcasted) - y = ct.add(y, b_broadcasted)If bias support is planned for the future, consider adding a comment explaining this is a placeholder.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (6)
tensorrt_llm/_torch/cuda_tile_kernels/__init__.py(1 hunks)tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py(1 hunks)tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py(1 hunks)tensorrt_llm/_torch/cuda_tile_utils.py(1 hunks)tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py(1 hunks)tensorrt_llm/_torch/modules/rms_norm.py(4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., usefrom package.subpackage import fooand thenfoo.SomeClass()instead offrom package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g.,some_file.py)
Python class names should use PascalCase (e.g.,class SomeClass)
Python function and method names should use snake_case (e.g.,def my_awesome_function():)
Python local variable names should use snake_case, with prefixkfor variable names that start with a number (e.g.,k_99th_percentile = ...)
Python global variables should use upper snake_case with prefixG(e.g.,G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g.,MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g.,self.x = 5followed by"""<type>: Description of 'x'""")
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic
Files:
tensorrt_llm/_torch/modules/rms_norm.pytensorrt_llm/_torch/cuda_tile_utils.pytensorrt_llm/_torch/cuda_tile_kernels/__init__.pytensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.pytensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.pytensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py
**/*.{cpp,h,cu,py}
📄 CodeRabbit inference engine (CODING_GUIDELINES.md)
All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top
Files:
tensorrt_llm/_torch/modules/rms_norm.pytensorrt_llm/_torch/cuda_tile_utils.pytensorrt_llm/_torch/cuda_tile_kernels/__init__.pytensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.pytensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.pytensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py
🧠 Learnings (8)
📓 Common learnings
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.
Applied to files:
tensorrt_llm/_torch/modules/rms_norm.pytensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.pytensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.pytensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py
📚 Learning: 2025-08-27T14:41:56.665Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:96-99
Timestamp: 2025-08-27T14:41:56.665Z
Learning: In tensorrt_llm/_torch/modules/rms_norm.py, the RMSNorm class uses a custom sentinel (_ARGUMENT_NOT_SPECIFIED_SENTINEL) instead of Ellipsis (...) for detecting unspecified optional arguments. Other modules in the codebase may use Ellipsis as a sentinel but do not forward it to RMSNorm methods, so there's no need for backward compatibility with Ellipsis in RMSNorm.
Applied to files:
tensorrt_llm/_torch/modules/rms_norm.pytensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.pytensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.pytensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py
📚 Learning: 2025-10-20T17:09:21.560Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.
Applied to files:
tensorrt_llm/_torch/modules/rms_norm.pytensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.pytensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.pytensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py
📚 Learning: 2025-08-22T01:54:35.850Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.
Applied to files:
tensorrt_llm/_torch/cuda_tile_kernels/__init__.py
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.
Applied to files:
tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py
📚 Learning: 2025-10-13T19:45:03.518Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.
Applied to files:
tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.
Applied to files:
tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/cuda_tile_kernels/__init__.py (2)
tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py (3)
rms_norm_kernel(9-59)rms_norm_kernel_gather(63-100)rms_norm_kernel_static_persistent(104-181)tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py (3)
rms_norm_fuse_residual_kernel(9-78)rms_norm_fuse_residual_kernel_gather(82-130)rms_norm_fuse_residual_kernel_static_persistent(134-228)
tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py (2)
tests/unittest/_torch/attention/sparse/test_dsa_indexer.py (1)
cdiv(44-46)tensorrt_llm/functional.py (3)
sum(3253-3275)arange(1498-1569)scatter(2279-2310)
🪛 Ruff (0.14.7)
tensorrt_llm/_torch/modules/rms_norm.py
46-46: Avoid specifying long messages outside the exception class
(TRY003)
48-48: Avoid specifying long messages outside the exception class
(TRY003)
tensorrt_llm/_torch/cuda_tile_kernels/__init__.py
11-18: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py
86-86: Unused function argument: weight
(ARG001)
87-87: Unused function argument: eps
(ARG001)
88-88: Unused function argument: static_persistent
(ARG001)
89-89: Unused function argument: gather
(ARG001)
90-90: Unused function argument: use_gemma
(ARG001)
🔇 Additional comments (8)
tensorrt_llm/_torch/modules/rms_norm.py (1)
79-104: CUDA Tile path integration looks correct.The branching logic properly prioritizes the CUDA Tile path when enabled, with appropriate fallbacks. The contiguous() calls for the fused residual path align with the in-place operation requirements.
tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py (1)
83-92: Fake registration signature is correct.The unused arguments flagged by static analysis (ARG001) are false positives. These parameters are required to match the real op's signature for
torch.compilecompatibility.tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py (1)
8-59: Standard RMSNorm kernel implementation looks correct.The kernel properly accumulates squared values, computes the reciprocal square root of the mean, and applies the normalization with optional Gemma-style bias. The two-pass approach (sum then normalize) is appropriate for numerical stability.
tensorrt_llm/_torch/cuda_tile_utils.py (1)
9-20: Add guard or document precondition fornext_power_of_2.The function returns 0 for input 0 (not a power of 2) and has undefined behavior for negative inputs. Add either input validation or a docstring clearly stating the precondition that
nmust be a positive integer.tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py (4)
1-6: LGTM on conditional import structure.The conditional import pattern correctly guards CUDA tile functionality and ensures the file can be imported even when CUDA tile is unavailable.
8-79: LGTM on kernel implementation.The two-pass fused residual RMSNorm logic is correct:
- First pass computes x + residual, stores to residual, accumulates squared values
- Computes RMS normalization factor using rsqrt
- Second pass applies normalization and weight scaling
The in-place mutation of the residual tensor is intentional and correctly synchronized since each row is processed independently.
81-131: LGTM on gather-based kernel variant.The gather/scatter implementation is functionally equivalent to the tiled load/store version, with appropriate adjustments for 1D tile shapes and axis indices.
133-229: LGTM on static persistent kernel structure.The static persistent implementation correctly processes multiple row blocks per program with proper residual fusion, RMS computation, and normalization. The latency hints and TMA settings are well-documented with performance impact notes.
|
|
||
|
|
||
| @ct.kernel | ||
| def rms_norm_kernel( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please add unit tests for these cuda tile kernels?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. Will do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added test script tests/unittest/_torch/thop/parallel/test_cuda_tile_custom_ops.py which coved all kernels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed that these tests pass without installing a seconday CUDA toolkit on Blackwell GPUs. Could you take a look and mark this thread as resolved? @QiJune
840ab44 to
7b2ce4e
Compare
b6451f6 to
56b2dbc
Compare
|
|
||
| install_auxiliary_cuda_toolkit() { | ||
| local AUXI_CUDA_VER_SHORT=${AUXI_CUDA_VER%_*} | ||
| curl -L -o auxi_cuda_linux.sh https://developer.download.nvidia.com/compute/cuda/${AUXI_CUDA_VER_SHORT}/local_installers/cuda_${AUXI_CUDA_VER}_linux.run |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @chzblych @ZhanruiSunCh to review this part
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how AUXI_CUDA_VER and AUXI_CUDA_INSTALL_PATH are exactly defined in the PR. It looks like the new package can only be installed when staring a docker container locally?
How does it affect CI pipeline and CI images?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how AUXI_CUDA_VER and AUXI_CUDA_INSTALL_PATH are exactly defined in the PR
No code directly refer to these env variables; They just specify a secondary CUDA toolkit install directory then cuda.tile compiler will find the corresponding locations.
It looks like the new package can only be installed when staring a docker container locally?
Yes, only when user explicitly specified AUXI_CUDA_VER and AUXI_CUDA_INSTALL_PATH at make build step locally.
How does it affect CI pipeline and CI images?
I'm not sure about this, but this should not affect other build/test/runtime logic as long as the installed toolkit not symlink to `/usr/local/bin'.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verified locally that after upgrading TRTLLM base image to 25.12, TileIR kernerls able to functional and pass our newly added tests. The only modification beyond _torch/ directory is the new 2 dependenceis added to requirements.txt. Could you take a look again and possibly grant CI running approvals? @chzblych
56b2dbc to
b4458ba
Compare
|
/bot run |
e21bf18 to
64824b7
Compare
2095a41 to
41f28e2
Compare
Co-authored-by: Jinman Xie <jinmanx@nvidia.com> Co-authored-by: Alexey Bylinkin <abylinkin@nvidia.com> Co-authored-by: Qiqi Xiao <qiqix@nvidia.com> Co-authored-by: Biao Wang <biaow@nvidia.com> Co-authored-by: Thomas Schmid <thschmid@nvidia.com> Signed-off-by: Rundong (David) Li <davidli@nvidia.com>
41f28e2 to
20b62b4
Compare
|
/bot run |
1 similar comment
|
/bot run |
|
PR_Github #33598 [ run ] triggered by Bot. Commit: |
|
PR_Github #33598 [ run ] completed with state
|
Summary by CodeRabbit
use_cuda_tileparameter in the RMSNorm module.✏️ Tip: You can customize this high-level summary in your review settings.
Description
Why?
What?
cuda.tileDSLRMSNormtorch moduleBuild process temporary changeDue to PyTorch DGX container did not shiptileirascompiler nor CUDA toolkit >= 13.1, at this point we have to install a secondary CUDA toolkit at TRTLLM docker build step to ensurecuda.tilefind a proper tileiras compiler at runtime:Above workarounds are no longer necessary since TRTLLM base image has been upgraded to DLFW 25.12. The only modification to packing is adding cuda-tile and nvidia-cuda-tileiras PyPi packages to requirements.txt.
Performance
We measured the end-to-end throughput with these new
cuda.tileRMS norm kernels integrated to DeepSeek-V3-Lite. The experiment setup is:flashinferRMS norm kernels with possible residual fusions;cuda.tilemodels use persistent cuTile kernels with possible residual fusionstrtllm-benchOutput Throughput DeepSeek-V3-Lite/bf16 (B200, token/s, ↑)
Test Coverage
cuda.tilekernelsPR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.