-
Notifications
You must be signed in to change notification settings - Fork 2k
[#10780][feat] AutoDeploy: Support per-expert scales in FP8 MoE #10814
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[#10780][feat] AutoDeploy: Support per-expert scales in FP8 MoE #10814
Conversation
📝 WalkthroughWalkthroughThis PR introduces support for handling different per-expert input scales in FP8 MoE fusion. An Changes
Sequence Diagram(s)sequenceDiagram
participant Config as Config (allow_different_input_scales flag)
participant Transform as FuseFP8Moe._apply
participant StackWeights as _stack_fp8_moe_weights
participant ComputeScales as Compute Max Scales
participant CustomOp as Custom Operator<br/>(trtllm_quant_fp8_moe_fused)
Config->>Transform: Load config with allow_different_input_scales
Transform->>StackWeights: Call with flag
StackWeights->>ComputeScales: Check expert input scales
alt allow_different_input_scales enabled
ComputeScales->>ComputeScales: Find max scale across experts
ComputeScales->>StackWeights: Return max scales
StackWeights->>StackWeights: Log warning (scales differ)
else allow_different_input_scales disabled
ComputeScales->>ComputeScales: Assert all scales identical
ComputeScales->>StackWeights: Return scales
end
StackWeights->>CustomOp: Pass fc1_act_scale_max<br/>(precomputed max)
CustomOp->>CustomOp: Quantize with max scale
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_trtllm_moe.py`:
- Around line 1143-1144: Fix the assertion message in the torch.allclose check:
update the f-string in the assertion that references torch.allclose(output,
ref_output, rtol=0.05, atol=0.05) so the message includes a space (or newline)
before "Max diff:" and correctly reflects the rtol/atol values used; adjust the
text to something like "rtol=0.05, atol=0.05 Max diff: {(output -
ref_output).abs().max()}" to avoid concatenation and value mismatch.
🧹 Nitpick comments (1)
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_trtllm_moe.py (1)
1127-1134: Consider using attribute access instead of getattr with constant strings.The static analysis hint is valid but minor. Using direct attribute access is cleaner when the attribute name is known at write-time.
♻️ Suggested improvement
if backend == "trtllm": - actual_w1_input_max = getattr(gm, "quant_moe_fc1_act_scale_max_0") + actual_w1_input_max = gm.quant_moe_fc1_act_scale_max_0 else: - actual_w1_input_max = getattr(gm, "quant_moe_w1_input_scale_max_0").squeeze() + actual_w1_input_max = gm.quant_moe_w1_input_scale_max_0.squeeze()
tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_trtllm_moe.py
Outdated
Show resolved
Hide resolved
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #32652 [ run ] triggered by Bot. Commit: |
|
PR_Github #32652 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #32710 [ run ] triggered by Bot. Commit: |
|
PR_Github #32710 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #32749 [ run ] triggered by Bot. Commit: |
|
PR_Github #32749 [ run ] completed with state
|
1c7e940 to
86396a1
Compare
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #32777 [ run ] triggered by Bot. Commit: |
86396a1 to
c2f5e37
Compare
|
/bot kill |
|
PR_Github #32791 [ kill ] triggered by Bot. Commit: |
|
PR_Github #32791 [ kill ] completed with state |
c2f5e37 to
caa90fa
Compare
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #32793 [ run ] triggered by Bot. Commit: |
| gpu: | ||
| - '*b200*' | ||
| linux_distribution_name: ubuntu* | ||
| cpu: x86_64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucaslie please take a look
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we absolutely need 8x B200?
- If not, please add the test to the 4x B200 stage.
- If yes please combine all B200 multi-gpu tests into a single 8x B200 stage
This is because stages have overhead and I was asked by the CI team to ensure we don't unnecessarily add too many stages, see here
Also, if you do add a new stage:
- please follow the steps in my PR: [None][infra] separate AutoDeploy tests into own stages #10634
- update our dev guide and my pinned slack message to correctly reference the new AutoDeploy test stages and their names
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't know it's such a big overhead. I'll stick with the world_size=4 tests and remove this stage. Same for dgx_h100.
|
PR_Github #33556 [ run ] triggered by Bot. Commit: |
|
PR_Github #33556 [ run ] completed with state
|
96b3ee5 to
009e418
Compare
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_B200-8_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_H100-8_GPUs-AutoDeploy-1" |
|
PR_Github #33588 [ run ] triggered by Bot. Commit: |
|
PR_Github #33588 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_B200-8_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_H100-8_GPUs-AutoDeploy-1" |
|
PR_Github #33599 [ run ] triggered by Bot. Commit: |
|
PR_Github #33599 [ run ] completed with state
|
|
/bot run |
|
PR_Github #33604 [ run ] triggered by Bot. Commit: |
| stage: post_load_fusion | ||
| enabled: true | ||
| backend: trtllm | ||
| allow_different_input_scales: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why wouldn't we just allow this by default if this is the default behavior?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucaslie Just being cautious. We can enable anytime.
We should be getting a checkpoint with all identical scales at some point, I don't know how common it is to have different scales (allegedly not at all).
https://nvidia.slack.com/archives/C09D47NAYNR/p1767902427747889
| enabled: true | ||
| backend: trtllm | ||
| allow_different_input_scales: false | ||
| fuse_nvfp4_moe: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did you also implement this for nvfp4? If not, can you either add it here or add a ticket to our backlog?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lucaslie nvfp4 is using per-block dynamic quantization on the inputs, so scale computation is part of the kernel. This is only relevant for fp8.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correction. This seems to be an issue for nvfp4 too.
@tcherckez-nvidia to handle as part of nvfp4 super enablement.
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py
Outdated
Show resolved
Hide resolved
tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py
Outdated
Show resolved
Hide resolved
| gpu: | ||
| - '*b200*' | ||
| linux_distribution_name: ubuntu* | ||
| cpu: x86_64 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we absolutely need 8x B200?
- If not, please add the test to the 4x B200 stage.
- If yes please combine all B200 multi-gpu tests into a single 8x B200 stage
This is because stages have overhead and I was asked by the CI team to ensure we don't unnecessarily add too many stages, see here
Also, if you do add a new stage:
- please follow the steps in my PR: [None][infra] separate AutoDeploy tests into own stages #10634
- update our dev guide and my pinned slack message to correctly reference the new AutoDeploy test stages and their names
|
PR_Github #33604 [ run ] completed with state
|
…n FP8 MoE FP8 MoE kernel requires a scalar input scales, but models may have different input scales per expert. Previously, the autodeploy code used the first expert's scale (input_scale[0]), which could cause accuracy issues when scales differ significantly. Changes: - Use max(input_scale) for FC1 and FC2 input quantization, matching TRT-LLM manual backend - Precompute max input scales at transform time for both trtllm and triton backends - Add config option to FuseFP8MoeConfig: - False (default): Assert all experts have identical scales, fail if not - True: Allow different scales with a warning, use max() for quantization - Update kernel signatures to take precomputed scalar scales instead of tensor scales - Add unit test for the new config option Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>
2d40ccb to
c566908
Compare
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" |
|
PR_Github #33619 [ run ] triggered by Bot. Commit: |
|
PR_Github #33619 [ run ] completed with state
|
|
/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast |
|
PR_Github #33628 [ run ] triggered by Bot. Commit: |
Description
FP8 MoE kernel requires a scalar input scales, but models may have different input scales per expert. Previously, the autodeploy code used the first expert's scale (input_scale[0]), which could cause accuracy issues when scales differ significantly.
Changes:
Test Coverage
tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_moe_fusion.py::test_fp8_moe_different_input_scales
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...Provide a user friendly way for developers to interact with a Jenkins server.
Run
/bot [-h|--help]to print this help message.See details below for each supported subcommand.
Details
run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]Launch build/test pipelines. All previously running jobs will be killed.
--reuse-test (optional)pipeline-id(OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.--disable-reuse-test(OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.--disable-fail-fast(OPTIONAL) : Disable fail fast on build/tests/infra failures.--skip-test(OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.--stage-list "A10-PyTorch-1, xxx"(OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.--gpu-type "A30, H100_PCIe"(OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.--test-backend "pytorch, cpp"(OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.--only-multi-gpu-test(OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.--disable-multi-gpu-test(OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.--add-multi-gpu-test(OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.--post-merge(OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx"(OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".--detailed-log(OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.--debug(OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in thestage-listparameter to access the appropriate container environment. Note: Does NOT update GitHub check status.For guidance on mapping tests to stage names, see
docs/source/reference/ci-overview.mdand the
scripts/test_to_stage_mapping.pyhelper.kill
killKill all running builds associated with pull request.
skip
skip --comment COMMENTSkip testing for latest commit on pull request.
--comment "Reason for skipping build/test"is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.reuse-pipeline
reuse-pipelineReuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.
Summary by CodeRabbit
New Features
Tests
✏️ Tip: You can customize this high-level summary in your review settings.