Skip to content

Conversation

@lirundong
Copy link

@lirundong lirundong commented Dec 5, 2025

Summary by CodeRabbit

  • New Features
    • Added CUDA Tile backend for RMSNorm operations with optional fused residual computation, enabling alternative normalization path for potential performance improvements. Users can enable this feature via the use_cuda_tile parameter in the RMSNorm module.
    • Supports multiple kernel variants including static-persistent execution models for optimized compute patterns.

✏️ Tip: You can customize this high-level summary in your review settings.

Description

Why?

  • Verify TileIR's functional completeness in real-world inference workloads

What?

  • Add 6 RMSNorm kernels written in cuda.tile DSL
  • Wrap them with 2 PyTorch custom OPs
  • Integrate to trt-llm's RMSNorm torch module

Build process temporary change

Due to PyTorch DGX container did not ship tileiras compiler nor CUDA toolkit >= 13.1, at this point we have to install a secondary CUDA toolkit at TRTLLM docker build step to ensure cuda.tile find a proper tileiras compiler at runtime:

# Previously
# make -C docker build
# Replace with
$ make -C docker build AUXI_CUDA_VER=13.1.0_590.44.01 AUXI_CUDA_INSTALL_PATH=/opt/cuda-13.1

Above workarounds are no longer necessary since TRTLLM base image has been upgraded to DLFW 25.12. The only modification to packing is adding cuda-tile and nvidia-cuda-tileiras PyPi packages to requirements.txt.

Performance

We measured the end-to-end throughput with these new cuda.tile RMS norm kernels integrated to DeepSeek-V3-Lite. The experiment setup is:

  • 1x B200 GPU, TP1
  • Synthetic data generated from benchmarks/cpp/prepare_dataset.py
  • Baseline models use flashinfer RMS norm kernels with possible residual fusions; cuda.tile models use persistent cuTile kernels with possible residual fusions
  • Report Total Output Throughput (tokens/sec) scores measured by trtllm-bench

Output Throughput DeepSeek-V3-Lite/bf16 (B200, token/s, ↑)

ISL, OSL Flash infer cuda.tile
128, 128 15748.73 19974.92
128, 2048 12586.61 12008.21
128, 4096 7350.47 6935.28

Test Coverage

  • Newly added test tests/unittest/_torch/thop/parallel/test_cuda_tile_custom_ops.py that covers all 2 custom operations and 6 cuda.tile kernels

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

@lirundong lirundong requested review from a team as code owners December 5, 2025 03:58
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 5, 2025

📝 Walkthrough

Walkthrough

This change introduces CUDA tile-based RMSNorm kernels with optional residual fusion and integrates them as an optional path into the existing RMSNorm module. It adds utility functions for CUDA tile availability detection, three kernel variants (standard, gather, and static-persistent) for both regular and fused-residual normalization, a custom Torch operation layer, and conditional re-exports.

Changes

Cohort / File(s) Summary
CUDA Tile Utilities
tensorrt_llm/_torch/cuda_tile_utils.py
New module providing CUDA tile availability detection (IS_CUDA_TILE_AVAILABLE), power-of-two calculation (next_power_of_2), and ceiling division helper (ceil_div). Conditionally sets flag based on platform and import availability.
RMSNorm Kernel Implementations
tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py, tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py
Introduces six CUDA tile kernels: three RMSNorm variants (rms_norm_kernel, rms_norm_kernel_gather, rms_norm_kernel_static_persistent) and three fused-residual variants (rms_norm_fuse_residual_kernel, rms_norm_fuse_residual_kernel_gather, rms_norm_fuse_residual_kernel_static_persistent). Each supports tiled operations, optional Gemma bias, and multiple execution models.
CUDA Tile Module Integration
tensorrt_llm/_torch/cuda_tile_kernels/__init__.py
Conditionally re-exports the six CUDA tile kernel symbols under __all__ when IS_CUDA_TILE_AVAILABLE is true.
Custom PyTorch Operations
tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py
Implements two custom PyTorch ops: cuda_tile_rms_norm (standard normalization with tile size and kernel dispatch) and cuda_tile_rms_norm_fuse_residual_ (in-place fused-residual variant with device/kernel selection logic).
RMSNorm Module Integration
tensorrt_llm/_torch/modules/rms_norm.py
Adds use_cuda_tile parameter to RMSNorm.__init__ with runtime availability check. Routes forward() through CUDA tile kernels when enabled, selecting between fused-residual or standard kernel based on residual presence, while preserving existing FlashInFER path for backward compatibility.

Sequence Diagram(s)

sequenceDiagram
    participant User as User Code
    participant RMSNorm as RMSNorm Module
    participant CustomOp as Custom Op<br/>(cuda_tile_rms_norm)
    participant Kernel as CUDA Tile Kernel

    User->>RMSNorm: forward(x, weight,<br/>residual=None)
    alt use_cuda_tile=True
        RMSNorm->>RMSNorm: Check residual present
        alt residual provided
            RMSNorm->>CustomOp: cuda_tile_rms_norm_fuse_residual_<br/>(x, residual, weight, ...)
        else residual=None
            RMSNorm->>CustomOp: cuda_tile_rms_norm<br/>(x, weight, ...)
        end
        
        CustomOp->>CustomOp: Determine kernel path<br/>(static_persistent,<br/>gather flags)
        CustomOp->>Kernel: Launch selected kernel<br/>with tile config
        Kernel-->>CustomOp: return output/mutate x,residual
        CustomOp-->>RMSNorm: return output
    else use_cuda_tile=False
        RMSNorm->>RMSNorm: Use existing FlashInFER path
    end
    
    RMSNorm-->>User: return normalized output
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

  • Kernel logic complexity: Six new kernel implementations with intricate tiled operations, variance computation, and linear transformations requiring careful correctness review
  • Multiple execution models: Standard, gather-based, and static-persistent variants demand separate reasoning for each
  • Custom op dispatch: Careful review of kernel selection logic, parameter passing, tile sizing calculations, and device handling
  • Integration correctness: Forward path routing in RMSNorm module with residual handling and API compatibility
  • Dtype/device handling: Explicit dtype casting in kernels and consistency across module integration
  • Areas requiring extra attention:
    • Kernel correctness of tiled RMS computation and squared-sum reductions
    • Parameter passing between custom ops and kernels (tile sizes, eps, flags)
    • Static-persistent multi-block-per-program execution model in both regular and fused-residual kernels
    • Backward compatibility and mutual exclusivity with existing FlashInFER path
🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 53.85% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The PR title clearly summarizes the main change: integrating cuda.tile RMS norm kernels into the codebase, which aligns with the changeset.
Description check ✅ Passed The PR description includes most required template sections with clear explanations of the changes and test coverage mentioned.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Nitpick comments (5)
tensorrt_llm/_torch/modules/rms_norm.py (1)

93-93: Clarify or track the gather=False device assertion failure issue.

The comment indicates gather=False causes subsequent device assertion failures. Consider opening an issue to track this limitation, or add a more detailed explanation of the root cause and conditions under which it occurs.

Would you like me to open an issue to track this gather=False limitation?

tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py (1)

108-109: Consider using ValueError instead of assert for input validation.

Assertions can be disabled with -O flag. For public API boundaries, explicit exceptions provide clearer error messages and aren't stripped in optimized mode.

-        assert x.is_contiguous(), "x must be contiguous for in-place operation"
-        assert residual.is_contiguous(), "residual must be contiguous for in-place operation"
+        if not x.is_contiguous():
+            raise ValueError("x must be contiguous for in-place operation")
+        if not residual.is_contiguous():
+            raise ValueError("residual must be contiguous for in-place operation")
tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py (1)

167-171: Remove dead code: b_broadcasted is always zero.

The bias tensor b_broadcasted is hardcoded to 0.0 and the subsequent ct.add(y, b_broadcasted) is a no-op that wastes computation.

             # Step 6: Apply linear transformation
             # Broadcast weight to match input shape
             w_broadcasted = ct.reshape(w, (1, TILE_SIZE_N))
-            b_broadcasted = ct.full((1, TILE_SIZE_N), 0.0, dtype=ct.float32)

-            # Apply linear transformation: y = x_normalized * w + b
+            # Apply weight scaling: y = x_normalized * w
             y = ct.mul(x_normalized, w_broadcasted)
-            y = ct.add(y, b_broadcasted)
tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py (2)

155-155: Consider using ct.cdiv for consistency.

The manual ceiling division here differs from the approach in the other two kernels (lines 22, 95), which use ct.cdiv. For consistency and clarity, consider using the same pattern.

-        upper_bound = (M + TILE_SIZE_M - 1) // TILE_SIZE_M
+        upper_bound = ct.cdiv(M, TILE_SIZE_M)

214-218: Remove unnecessary zero bias.

Lines 214 and 218 create and add a zero bias tensor, which has no functional effect. This appears to be dead code or a placeholder for future bias support.

         # Step 6: Apply linear transformation
         # Broadcast weight to match input shape
         w_broadcasted = ct.reshape(w, (1, TILE_SIZE_N))
-        b_broadcasted = ct.full((1, TILE_SIZE_N), 0.0, dtype=ct.float32)
 
         # Apply linear transformation: y = x_normalized * w + b
         y = ct.mul(x_normalized, w_broadcasted)
-        y = ct.add(y, b_broadcasted)

If bias support is planned for the future, consider adding a comment explaining this is a placeholder.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between a736226 and fc5e8d6.

📒 Files selected for processing (6)
  • tensorrt_llm/_torch/cuda_tile_kernels/__init__.py (1 hunks)
  • tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py (1 hunks)
  • tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py (1 hunks)
  • tensorrt_llm/_torch/cuda_tile_utils.py (1 hunks)
  • tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py (1 hunks)
  • tensorrt_llm/_torch/modules/rms_norm.py (4 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
**/*.py

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

**/*.py: The code developed for TensorRT-LLM should conform to Python 3.8+
Indent Python code with 4 spaces; do not use tabs
Always maintain the namespace when importing in Python, even if only one class or function from a module is used (e.g., use from package.subpackage import foo and then foo.SomeClass() instead of from package.subpackage.foo import SomeClass)
Python filenames should use snake_case (e.g., some_file.py)
Python class names should use PascalCase (e.g., class SomeClass)
Python function and method names should use snake_case (e.g., def my_awesome_function():)
Python local variable names should use snake_case, with prefix k for variable names that start with a number (e.g., k_99th_percentile = ...)
Python global variables should use upper snake_case with prefix G (e.g., G_MY_GLOBAL = ...)
Python constants should use upper snake_case (e.g., MY_CONSTANT = ...)
Avoid shadowing variables declared in an outer scope in Python
Initialize all externally visible members of a Python class in the constructor
For Python interfaces that may be used outside a file, prefer docstrings over comments
Python comments should be reserved for code within a function, or interfaces that are local to a file
Use Google style docstrings for Python classes and functions, which can be parsed by Sphinx
Python attributes and variables can be documented inline with type and description (e.g., self.x = 5 followed by """<type>: Description of 'x'""" )
Avoid using reflection in Python when functionality can be easily achieved without reflection
When using try-except blocks in Python, limit the except clause to the smallest set of specific errors possible instead of catching all exceptions
When using try-except blocks in Python to handle multiple possible variable types (duck-typing), keep the body of the try as small as possible and use the else block to implement the logic

Files:

  • tensorrt_llm/_torch/modules/rms_norm.py
  • tensorrt_llm/_torch/cuda_tile_utils.py
  • tensorrt_llm/_torch/cuda_tile_kernels/__init__.py
  • tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py
  • tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py
  • tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py
**/*.{cpp,h,cu,py}

📄 CodeRabbit inference engine (CODING_GUIDELINES.md)

All TensorRT-LLM Open Source Software code files should contain an NVIDIA copyright header that includes the current year at the top

Files:

  • tensorrt_llm/_torch/modules/rms_norm.py
  • tensorrt_llm/_torch/cuda_tile_utils.py
  • tensorrt_llm/_torch/cuda_tile_kernels/__init__.py
  • tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py
  • tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py
  • tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py
🧠 Learnings (8)
📓 Common learnings
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.
📚 Learning: 2025-10-20T16:54:09.824Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py:6-6
Timestamp: 2025-10-20T16:54:09.824Z
Learning: In tensorrt_llm/_torch/auto_deploy/custom_ops/rms_norm.py, the import `from ...modules.mamba.layernorm_gated import _layer_norm_fwd` is correct and should not be changed to modules.fla.layernorm_gated. The _layer_norm_fwd function exists in both modules/mamba/layernorm_gated.py and modules/fla/layernorm_gated.py, but the mamba version is the intended implementation for this use case.

Applied to files:

  • tensorrt_llm/_torch/modules/rms_norm.py
  • tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py
  • tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py
  • tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py
📚 Learning: 2025-08-27T14:41:56.665Z
Learnt from: ixlmar
Repo: NVIDIA/TensorRT-LLM PR: 7294
File: tensorrt_llm/_torch/modules/rms_norm.py:96-99
Timestamp: 2025-08-27T14:41:56.665Z
Learning: In tensorrt_llm/_torch/modules/rms_norm.py, the RMSNorm class uses a custom sentinel (_ARGUMENT_NOT_SPECIFIED_SENTINEL) instead of Ellipsis (...) for detecting unspecified optional arguments. Other modules in the codebase may use Ellipsis as a sentinel but do not forward it to RMSNorm methods, so there's no need for backward compatibility with Ellipsis in RMSNorm.

Applied to files:

  • tensorrt_llm/_torch/modules/rms_norm.py
  • tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py
  • tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py
  • tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py
📚 Learning: 2025-10-20T17:09:21.560Z
Learnt from: nvchenghaoz
Repo: NVIDIA/TensorRT-LLM PR: 8469
File: tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py:180-182
Timestamp: 2025-10-20T17:09:21.560Z
Learning: In tensorrt_llm/_torch/auto_deploy/transform/library/rms_norm.py, the _gated_rmsnorm_replacement function does not need to cast the output of torch.ops.auto_deploy.torch_rmsnorm_gated back to the input dtype, even though the custom op returns fp32. The dtype handling is managed elsewhere or the fp32 output is acceptable for downstream consumers.

Applied to files:

  • tensorrt_llm/_torch/modules/rms_norm.py
  • tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py
  • tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py
  • tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py
📚 Learning: 2025-08-22T01:54:35.850Z
Learnt from: djns99
Repo: NVIDIA/TensorRT-LLM PR: 7104
File: cpp/tensorrt_llm/kernels/cutlass_kernels/include/moe_kernels.h:999-1000
Timestamp: 2025-08-22T01:54:35.850Z
Learning: The `internal_cutlass_kernels` directory in TensorRT-LLM is a mirror of an internal NVIDIA repository and maintains its own implementation and API that may diverge from the public `cutlass_kernels` version. API inconsistencies between these two directories are intentional and by design, not bugs to be fixed.

Applied to files:

  • tensorrt_llm/_torch/cuda_tile_kernels/__init__.py
📚 Learning: 2025-11-14T11:22:03.729Z
Learnt from: nzmora-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 9163
File: tensorrt_llm/_torch/auto_deploy/custom_ops/quant.py:107-113
Timestamp: 2025-11-14T11:22:03.729Z
Learning: In TensorRT-LLM AutoDeploy custom ops, when adding hardware capability checks to select between kernel implementations (e.g., cuBLAS vs. CUDA kernel), use descriptive variable names that identify the specific GPU architectures or families being targeted (e.g., `is_blackwell_geforce_or_ada`) rather than generic names like `enable_cuda_core`. This makes it clear that the code is selecting an implementation path based on hardware capabilities, not enabling/disabling hardware features.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py
📚 Learning: 2025-10-13T19:45:03.518Z
Learnt from: nv-lschneider
Repo: NVIDIA/TensorRT-LLM PR: 7910
File: tests/unittest/_torch/multi_gpu/test_nccl_device.py:138-149
Timestamp: 2025-10-13T19:45:03.518Z
Learning: In test_nccl_device.py, the NCCL device AllReduce implementation compares the entire residual tensor on each rank, unlike the UB implementation which compares per-rank chunks. The residual chunking calculations in the test are intentionally overridden to reflect this design difference.

Applied to files:

  • tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py
📚 Learning: 2025-08-08T22:03:40.707Z
Learnt from: sklevtsov-nvidia
Repo: NVIDIA/TensorRT-LLM PR: 3294
File: cpp/tensorrt_llm/kernels/cutlass_kernels/moe_gemm/moe_kernels.cu:1198-1209
Timestamp: 2025-08-08T22:03:40.707Z
Learning: In the CUTLASS MoE kernels (cpp/tensorrt_llm/cutlass_extensions), when `layout_info.fusion` is set to `TmaWarpSpecializedGroupedGemmInput::EpilogueFusion::FINALIZE`, the `router_scales` parameter must be non-null by design. The fused finalize kernel epilogue does not perform nullptr checks and requires valid router scales to function correctly. This is an implicit contract that callers must satisfy when enabling the FINALIZE fusion mode.

Applied to files:

  • tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py
🧬 Code graph analysis (2)
tensorrt_llm/_torch/cuda_tile_kernels/__init__.py (2)
tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py (3)
  • rms_norm_kernel (9-59)
  • rms_norm_kernel_gather (63-100)
  • rms_norm_kernel_static_persistent (104-181)
tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py (3)
  • rms_norm_fuse_residual_kernel (9-78)
  • rms_norm_fuse_residual_kernel_gather (82-130)
  • rms_norm_fuse_residual_kernel_static_persistent (134-228)
tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py (2)
tests/unittest/_torch/attention/sparse/test_dsa_indexer.py (1)
  • cdiv (44-46)
tensorrt_llm/functional.py (3)
  • sum (3253-3275)
  • arange (1498-1569)
  • scatter (2279-2310)
🪛 Ruff (0.14.7)
tensorrt_llm/_torch/modules/rms_norm.py

46-46: Avoid specifying long messages outside the exception class

(TRY003)


48-48: Avoid specifying long messages outside the exception class

(TRY003)

tensorrt_llm/_torch/cuda_tile_kernels/__init__.py

11-18: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py

86-86: Unused function argument: weight

(ARG001)


87-87: Unused function argument: eps

(ARG001)


88-88: Unused function argument: static_persistent

(ARG001)


89-89: Unused function argument: gather

(ARG001)


90-90: Unused function argument: use_gemma

(ARG001)

🔇 Additional comments (8)
tensorrt_llm/_torch/modules/rms_norm.py (1)

79-104: CUDA Tile path integration looks correct.

The branching logic properly prioritizes the CUDA Tile path when enabled, with appropriate fallbacks. The contiguous() calls for the fused residual path align with the in-place operation requirements.

tensorrt_llm/_torch/custom_ops/cuda_tile_custom_ops.py (1)

83-92: Fake registration signature is correct.

The unused arguments flagged by static analysis (ARG001) are false positives. These parameters are required to match the real op's signature for torch.compile compatibility.

tensorrt_llm/_torch/cuda_tile_kernels/rms_norm.py (1)

8-59: Standard RMSNorm kernel implementation looks correct.

The kernel properly accumulates squared values, computes the reciprocal square root of the mean, and applies the normalization with optional Gemma-style bias. The two-pass approach (sum then normalize) is appropriate for numerical stability.

tensorrt_llm/_torch/cuda_tile_utils.py (1)

9-20: Add guard or document precondition for next_power_of_2.

The function returns 0 for input 0 (not a power of 2) and has undefined behavior for negative inputs. Add either input validation or a docstring clearly stating the precondition that n must be a positive integer.

tensorrt_llm/_torch/cuda_tile_kernels/rms_norm_fuse_residual.py (4)

1-6: LGTM on conditional import structure.

The conditional import pattern correctly guards CUDA tile functionality and ensures the file can be imported even when CUDA tile is unavailable.


8-79: LGTM on kernel implementation.

The two-pass fused residual RMSNorm logic is correct:

  1. First pass computes x + residual, stores to residual, accumulates squared values
  2. Computes RMS normalization factor using rsqrt
  3. Second pass applies normalization and weight scaling

The in-place mutation of the residual tensor is intentional and correctly synchronized since each row is processed independently.


81-131: LGTM on gather-based kernel variant.

The gather/scatter implementation is functionally equivalent to the tiled load/store version, with appropriate adjustments for 1D tile shapes and axis indices.


133-229: LGTM on static persistent kernel structure.

The static persistent implementation correctly processes multiple row blocks per program with proper residual fusion, RMS computation, and normalization. The latency hints and TMA settings are well-documented with performance impact notes.

@svc-trtllm-gh-bot svc-trtllm-gh-bot added the Community want to contribute PRs initiated from Community label Dec 5, 2025


@ct.kernel
def rms_norm_kernel(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add unit tests for these cuda tile kernels?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. Will do.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added test script tests/unittest/_torch/thop/parallel/test_cuda_tile_custom_ops.py which coved all kernels.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirmed that these tests pass without installing a seconday CUDA toolkit on Blackwell GPUs. Could you take a look and mark this thread as resolved? @QiJune

@lirundong lirundong force-pushed the davidli/feat/cuda-tile-rms-norm branch 5 times, most recently from 840ab44 to 7b2ce4e Compare December 12, 2025 09:34
@lirundong lirundong requested review from a team as code owners December 12, 2025 09:34
@lirundong lirundong force-pushed the davidli/feat/cuda-tile-rms-norm branch 2 times, most recently from b6451f6 to 56b2dbc Compare December 12, 2025 10:11

install_auxiliary_cuda_toolkit() {
local AUXI_CUDA_VER_SHORT=${AUXI_CUDA_VER%_*}
curl -L -o auxi_cuda_linux.sh https://developer.download.nvidia.com/compute/cuda/${AUXI_CUDA_VER_SHORT}/local_installers/cuda_${AUXI_CUDA_VER}_linux.run
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @chzblych @ZhanruiSunCh to review this part

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see how AUXI_CUDA_VER and AUXI_CUDA_INSTALL_PATH are exactly defined in the PR. It looks like the new package can only be installed when staring a docker container locally?

How does it affect CI pipeline and CI images?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how AUXI_CUDA_VER and AUXI_CUDA_INSTALL_PATH are exactly defined in the PR

No code directly refer to these env variables; They just specify a secondary CUDA toolkit install directory then cuda.tile compiler will find the corresponding locations.

It looks like the new package can only be installed when staring a docker container locally?

Yes, only when user explicitly specified AUXI_CUDA_VER and AUXI_CUDA_INSTALL_PATH at make build step locally.

How does it affect CI pipeline and CI images?

I'm not sure about this, but this should not affect other build/test/runtime logic as long as the installed toolkit not symlink to `/usr/local/bin'.

Copy link
Author

@lirundong lirundong Jan 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified locally that after upgrading TRTLLM base image to 25.12, TileIR kernerls able to functional and pass our newly added tests. The only modification beyond _torch/ directory is the new 2 dependenceis added to requirements.txt. Could you take a look again and possibly grant CI running approvals? @chzblych

@lirundong lirundong force-pushed the davidli/feat/cuda-tile-rms-norm branch from 56b2dbc to b4458ba Compare January 16, 2026 08:47
@lirundong
Copy link
Author

/bot run

@lirundong lirundong force-pushed the davidli/feat/cuda-tile-rms-norm branch 2 times, most recently from e21bf18 to 64824b7 Compare January 23, 2026 06:07
@lirundong lirundong force-pushed the davidli/feat/cuda-tile-rms-norm branch 2 times, most recently from 2095a41 to 41f28e2 Compare January 26, 2026 07:14
Co-authored-by: Jinman Xie <jinmanx@nvidia.com>
Co-authored-by: Alexey Bylinkin <abylinkin@nvidia.com>
Co-authored-by: Qiqi Xiao <qiqix@nvidia.com>
Co-authored-by: Biao Wang <biaow@nvidia.com>
Co-authored-by: Thomas Schmid <thschmid@nvidia.com>
Signed-off-by: Rundong (David) Li <davidli@nvidia.com>
@lirundong lirundong force-pushed the davidli/feat/cuda-tile-rms-norm branch from 41f28e2 to 20b62b4 Compare January 26, 2026 08:49
@lirundong
Copy link
Author

/bot run

1 similar comment
@nv-lschneider
Copy link
Collaborator

/bot run

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33598 [ run ] triggered by Bot. Commit: 20b62b4

@tensorrt-cicd
Copy link
Collaborator

PR_Github #33598 [ run ] completed with state SUCCESS. Commit: 20b62b4
/LLM/main/L0_MergeRequest_PR pipeline #25918 completed with status: 'FAILURE'

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Community want to contribute PRs initiated from Community

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants