[#10780][feat] AutoDeploy: Support per-expert scales in FP8 MoE #10814

galagam · 2026-01-19T20:33:42Z

Description

FP8 MoE kernel requires a scalar input scales, but models may have different input scales per expert. Previously, the autodeploy code used the first expert's scale (input_scale[0]), which could cause accuracy issues when scales differ significantly.

Changes:

Use max(input_scale) for FC1 and FC2 input quantization, matching TRT-LLM manual backend
Precompute max input scales at transform time for both trtllm and triton backends
Add config option to FuseFP8MoeConfig:
- False (default): Assert all experts have identical scales, fail if not
- True: Allow different scales with a warning, use max() for quantization
Update kernel signatures to take precomputed scalar scales instead of tensor scales
Add unit test for the new config option

Test Coverage

tests/unittest/_torch/auto_deploy/unit/singlegpu/transformations/library/test_moe_fusion.py::test_fp8_moe_different_input_scales

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

/bot [-h] ['run', 'kill', 'skip', 'reuse-pipeline'] ...

Provide a user friendly way for developers to interact with a Jenkins server.

Run /bot [-h|--help] to print this help message.

See details below for each supported subcommand.

Details

run [--reuse-test (optional)pipeline-id --disable-fail-fast --skip-test --stage-list "A10-PyTorch-1, xxx" --gpu-type "A30, H100_PCIe" --test-backend "pytorch, cpp" --add-multi-gpu-test --only-multi-gpu-test --disable-multi-gpu-test --post-merge --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" --detailed-log --debug(experimental)]

Launch build/test pipelines. All previously running jobs will be killed.

--reuse-test (optional)pipeline-id (OPTIONAL) : Allow the new pipeline to reuse build artifacts and skip successful test stages from a specified pipeline or the last pipeline if no pipeline-id is indicated. If the Git commit ID has changed, this option will be always ignored. The DEFAULT behavior of the bot is to reuse build artifacts and successful test results from the last pipeline.

--disable-reuse-test (OPTIONAL) : Explicitly prevent the pipeline from reusing build artifacts and skipping successful test stages from a previous pipeline. Ensure that all builds and tests are run regardless of previous successes.

--disable-fail-fast (OPTIONAL) : Disable fail fast on build/tests/infra failures.

--skip-test (OPTIONAL) : Skip all test stages, but still run build stages, package stages and sanity check stages. Note: Does NOT update GitHub check status.

--stage-list "A10-PyTorch-1, xxx" (OPTIONAL) : Only run the specified test stages. Examples: "A10-PyTorch-1, xxx". Note: Does NOT update GitHub check status.

--gpu-type "A30, H100_PCIe" (OPTIONAL) : Only run the test stages on the specified GPU types. Examples: "A30, H100_PCIe". Note: Does NOT update GitHub check status.

--test-backend "pytorch, cpp" (OPTIONAL) : Skip test stages which don't match the specified backends. Only support [pytorch, cpp, tensorrt, triton]. Examples: "pytorch, cpp" (does not run test stages with tensorrt or triton backend). Note: Does NOT update GitHub pipeline status.

--only-multi-gpu-test (OPTIONAL) : Only run the multi-GPU tests. Note: Does NOT update GitHub check status.

--disable-multi-gpu-test (OPTIONAL) : Disable the multi-GPU tests. Note: Does NOT update GitHub check status.

--add-multi-gpu-test (OPTIONAL) : Force run the multi-GPU tests in addition to running L0 pre-merge pipeline.

--post-merge (OPTIONAL) : Run the L0 post-merge pipeline instead of the ordinary L0 pre-merge pipeline.

--extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx" (OPTIONAL) : Run the ordinary L0 pre-merge pipeline and specified test stages. Examples: --extra-stage "H100_PCIe-TensorRT-Post-Merge-1, xxx".

--detailed-log (OPTIONAL) : Enable flushing out all logs to the Jenkins console. This will significantly increase the log volume and may slow down the job.

--debug (OPTIONAL) : Experimental feature. Enable access to the CI container for debugging purpose. Note: Specify exactly one stage in the stage-list parameter to access the appropriate container environment. Note: Does NOT update GitHub check status.

For guidance on mapping tests to stage names, see docs/source/reference/ci-overview.md
and the scripts/test_to_stage_mapping.py helper.

kill

kill

Kill all running builds associated with pull request.

skip

skip --comment COMMENT

Skip testing for latest commit on pull request. --comment "Reason for skipping build/test" is required. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

reuse-pipeline

reuse-pipeline

Reuse a previous pipeline to validate current commit. This action will also kill all currently running builds associated with the pull request. IMPORTANT NOTE: This is dangerous since lack of user care and validation can cause top of tree to break.

Summary by CodeRabbit

New Features
- Added support for handling different input scales in FP8 Mixture of Experts fusion.
- New configuration option enables flexible quantization handling when experts have varying input scales.
- Improved quantization efficiency using precomputed maximum scales.
Tests
- Added comprehensive test coverage for FP8 MoE fusion across multiple backend configurations.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-19T20:44:14Z

📝 Walkthrough

Walkthrough

This PR introduces support for handling different per-expert input scales in FP8 MoE fusion. An allow_different_input_scales configuration flag is added to enable using maximum input scales across experts instead of requiring identical scales. Parameter names are updated from _input_scale to _input_scale_max and fc1_act_scale to fc1_act_scale_max to reflect precomputed maximum scales. The fusion transform computes per-expert maximum scales and passes them to custom operators when the flag is enabled.

Changes

Cohort / File(s)	Summary
Configuration Files `examples/auto_deploy/super_v3.yaml`, `tensorrt_llm/_torch/auto_deploy/config/default.yaml`	Added `allow_different_input_scales: false` flag to MoE fusion configuration in both example and default YAML configs.
Custom Operators - Triton Backend `tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py`	Renamed input scale parameters from `w1_input_scale`, `w2_input_scale`, `w3_input_scale` to `w1_input_scale_max`, `w2_input_scale_max`, `w3_input_scale_max` to reflect precomputed maximum scales. Updated corresponding fake variant function signature.
Custom Operators - TensorRT-LLM Backend `tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py`	Renamed `fc1_act_scale` parameter to `fc1_act_scale_max` in both `trtllm_quant_fp8_moe_fused` and `trtllm_quant_fp8_moe_fused_fake`. Updated quantization flow to use scalar maximum activation scale instead of per-element scales.
Fusion Transform Core Logic `tensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.py`	Implemented feature: added `allow_different_input_scales` parameter to `_stack_fp8_moe_weights` and `FuseFP8MoeConfig`. When enabled, computes per-expert maximum input scales and logs warnings instead of asserting identical scales. Updated argument construction to pass maximum scales to custom operators. Added `ad_logger` import.
Tests `tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_trtllm_moe.py`	Added parametrized test `test_fp8_moe_different_input_scales` covering multiple backends and scale configurations. Updated existing FP8 MoE test call sites to use new parameter names (`fc1_act_scale_max`, `hidden_states_scale.reshape(1)`).

Sequence Diagram(s)

sequenceDiagram
    participant Config as Config (allow_different_input_scales flag)
    participant Transform as FuseFP8Moe._apply
    participant StackWeights as _stack_fp8_moe_weights
    participant ComputeScales as Compute Max Scales
    participant CustomOp as Custom Operator<br/>(trtllm_quant_fp8_moe_fused)
    
    Config->>Transform: Load config with allow_different_input_scales
    Transform->>StackWeights: Call with flag
    StackWeights->>ComputeScales: Check expert input scales
    
    alt allow_different_input_scales enabled
        ComputeScales->>ComputeScales: Find max scale across experts
        ComputeScales->>StackWeights: Return max scales
        StackWeights->>StackWeights: Log warning (scales differ)
    else allow_different_input_scales disabled
        ComputeScales->>ComputeScales: Assert all scales identical
        ComputeScales->>StackWeights: Return scales
    end
    
    StackWeights->>CustomOp: Pass fc1_act_scale_max<br/>(precomputed max)
    CustomOp->>CustomOp: Quantize with max scale

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 33.33% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description check	✅ Passed	The PR description clearly explains the issue, solution, test coverage, and includes a completed checklist; all required sections are present and substantively filled.
Title check	✅ Passed	The title clearly summarizes the main change: adding support for per-expert scales in FP8 MoE fusion within AutoDeploy, which directly aligns with all the file modifications across the codebase.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In
`@tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_trtllm_moe.py`:
- Around line 1143-1144: Fix the assertion message in the torch.allclose check:
update the f-string in the assertion that references torch.allclose(output,
ref_output, rtol=0.05, atol=0.05) so the message includes a space (or newline)
before "Max diff:" and correctly reflects the rtol/atol values used; adjust the
text to something like "rtol=0.05, atol=0.05 Max diff: {(output -
ref_output).abs().max()}" to avoid concatenation and value mismatch.

🧹 Nitpick comments (1)

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_trtllm_moe.py (1)
1127-1134: Consider using attribute access instead of getattr with constant strings.

The static analysis hint is valid but minor. Using direct attribute access is cleaner when the attribute name is known at write-time.
♻️ Suggested improvement
             if backend == "trtllm":
-                actual_w1_input_max = getattr(gm, "quant_moe_fc1_act_scale_max_0")
+                actual_w1_input_max = gm.quant_moe_fc1_act_scale_max_0
             else:
-                actual_w1_input_max = getattr(gm, "quant_moe_w1_input_scale_max_0").squeeze()
+                actual_w1_input_max = gm.quant_moe_w1_input_scale_max_0.squeeze()

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_trtllm_moe.py

galagam · 2026-01-20T03:59:05Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-01-20T04:05:42Z

PR_Github #32652 [ run ] triggered by Bot. Commit: 831ecd9

tensorrt-cicd · 2026-01-20T05:32:05Z

PR_Github #32652 [ run ] completed with state SUCCESS. Commit: 831ecd9
/LLM/main/L0_MergeRequest_PR pipeline #25276 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

galagam · 2026-01-20T09:06:10Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-01-20T09:11:41Z

PR_Github #32710 [ run ] triggered by Bot. Commit: 5babb91

tensorrt-cicd · 2026-01-20T12:21:30Z

PR_Github #32710 [ run ] completed with state SUCCESS. Commit: 5babb91
/LLM/main/L0_MergeRequest_PR pipeline #25316 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

galagam · 2026-01-20T12:44:01Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-01-20T12:49:47Z

PR_Github #32749 [ run ] triggered by Bot. Commit: 1c7e940

tensorrt-cicd · 2026-01-20T15:15:07Z

PR_Github #32749 [ run ] completed with state SUCCESS. Commit: 1c7e940
/LLM/main/L0_MergeRequest_PR pipeline #25348 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

galagam · 2026-01-20T16:27:57Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-01-20T16:34:06Z

PR_Github #32777 [ run ] triggered by Bot. Commit: 86396a1

galagam · 2026-01-20T17:50:13Z

/bot kill

tensorrt-cicd · 2026-01-20T17:56:25Z

PR_Github #32791 [ kill ] triggered by Bot. Commit: c2f5e37

tensorrt-cicd · 2026-01-20T17:56:59Z

PR_Github #32791 [ kill ] completed with state SUCCESS. Commit: c2f5e37
Successfully killed previous jobs for commit c2f5e37

galagam · 2026-01-20T18:24:46Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-01-20T18:30:13Z

PR_Github #32793 [ run ] triggered by Bot. Commit: caa90fa

galagam · 2026-01-20T18:40:08Z

tests/integration/test_lists/test-db/l0_dgx_b200.yml

+      gpu:
+      - '*b200*'
+      linux_distribution_name: ubuntu*
+      cpu: x86_64


@lucaslie please take a look

Do we absolutely need 8x B200?

If not, please add the test to the 4x B200 stage.

If yes please combine all B200 multi-gpu tests into a single 8x B200 stage

This is because stages have overhead and I was asked by the CI team to ensure we don't unnecessarily add too many stages, see here

Also, if you do add a new stage:

please follow the steps in my PR: [None][infra] separate AutoDeploy tests into own stages #10634

update our dev guide and my pinned slack message to correctly reference the new AutoDeploy test stages and their names

Didn't know it's such a big overhead. I'll stick with the world_size=4 tests and remove this stage. Same for dgx_h100.

tensorrt-cicd · 2026-01-26T08:14:53Z

PR_Github #33556 [ run ] triggered by Bot. Commit: 96b3ee5

tensorrt-cicd · 2026-01-26T11:16:54Z

PR_Github #33556 [ run ] completed with state SUCCESS. Commit: 96b3ee5
/LLM/main/L0_MergeRequest_PR pipeline #25883 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

galagam · 2026-01-26T12:01:49Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_B200-8_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_H100-8_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-01-26T12:09:00Z

PR_Github #33588 [ run ] triggered by Bot. Commit: 009e418

tensorrt-cicd · 2026-01-26T12:41:33Z

PR_Github #33588 [ run ] completed with state FAILURE. Commit: 009e418
/LLM/main/L0_MergeRequest_PR pipeline #25909 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

galagam · 2026-01-26T13:25:41Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_B200-8_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1, DGX_H100-8_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-01-26T13:32:01Z

PR_Github #33599 [ run ] triggered by Bot. Commit: 009e418

tensorrt-cicd · 2026-01-26T13:59:43Z

PR_Github #33599 [ run ] completed with state FAILURE. Commit: 009e418
/LLM/main/L0_MergeRequest_PR pipeline #25919 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

galagam · 2026-01-26T14:05:51Z

/bot run

tensorrt-cicd · 2026-01-26T14:12:00Z

PR_Github #33604 [ run ] triggered by Bot. Commit: 009e418

lucaslie · 2026-01-26T14:20:14Z

tensorrt_llm/_torch/auto_deploy/config/default.yaml

    stage: post_load_fusion
    enabled: true
    backend: trtllm
+    allow_different_input_scales: false


why wouldn't we just allow this by default if this is the default behavior?

@lucaslie Just being cautious. We can enable anytime.
We should be getting a checkpoint with all identical scales at some point, I don't know how common it is to have different scales (allegedly not at all).

https://nvidia.slack.com/archives/C09D47NAYNR/p1767902427747889

lucaslie · 2026-01-26T14:20:57Z

tensorrt_llm/_torch/auto_deploy/config/default.yaml

    enabled: true
    backend: trtllm
+    allow_different_input_scales: false
  fuse_nvfp4_moe:


did you also implement this for nvfp4? If not, can you either add it here or add a ticket to our backlog?

@lucaslie nvfp4 is using per-block dynamic quantization on the inputs, so scale computation is part of the kernel. This is only relevant for fp8.

Correction. This seems to be an issue for nvfp4 too.
@tcherckez-nvidia to handle as part of nvfp4 super enablement.

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/triton_moe.py

tensorrt_llm/_torch/auto_deploy/custom_ops/fused_moe/trtllm_moe.py

tensorrt_llm/_torch/auto_deploy/transform/library/fused_moe.py

tensorrt_llm/_torch/auto_deploy/transform/library/multi_stream_moe.py

lucaslie · 2026-01-26T14:32:21Z

tests/integration/test_lists/test-db/l0_dgx_b200.yml

+      gpu:
+      - '*b200*'
+      linux_distribution_name: ubuntu*
+      cpu: x86_64


Do we absolutely need 8x B200?

If not, please add the test to the 4x B200 stage.

If yes please combine all B200 multi-gpu tests into a single 8x B200 stage

This is because stages have overhead and I was asked by the CI team to ensure we don't unnecessarily add too many stages, see here

Also, if you do add a new stage:

please follow the steps in my PR: [None][infra] separate AutoDeploy tests into own stages #10634

update our dev guide and my pinned slack message to correctly reference the new AutoDeploy test stages and their names

tests/integration/test_lists/test-db/l0_dgx_h100.yml

tensorrt-cicd · 2026-01-26T16:19:55Z

PR_Github #33604 [ run ] completed with state SUCCESS. Commit: 009e418
/LLM/main/L0_MergeRequest_PR pipeline #25923 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

…n FP8 MoE FP8 MoE kernel requires a scalar input scales, but models may have different input scales per expert. Previously, the autodeploy code used the first expert's scale (input_scale[0]), which could cause accuracy issues when scales differ significantly. Changes: - Use max(input_scale) for FC1 and FC2 input quantization, matching TRT-LLM manual backend - Precompute max input scales at transform time for both trtllm and triton backends - Add config option to FuseFP8MoeConfig: - False (default): Assert all experts have identical scales, fail if not - True: Allow different scales with a warning, use max() for quantization - Update kernel signatures to take precomputed scalar scales instead of tensor scales - Add unit test for the new config option Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

galagam · 2026-01-26T17:46:22Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1"

tensorrt-cicd · 2026-01-26T17:52:18Z

PR_Github #33619 [ run ] triggered by Bot. Commit: c566908

tensorrt-cicd · 2026-01-26T20:09:33Z

PR_Github #33619 [ run ] completed with state SUCCESS. Commit: c566908
/LLM/main/L0_MergeRequest_PR pipeline #25934 completed with status: 'FAILURE'

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

galagam · 2026-01-26T20:51:53Z

/bot run --extra-stage "DGX_B200-4_GPUs-AutoDeploy-1, DGX_H100-4_GPUs-AutoDeploy-1" --disable-fail-fast

tensorrt-cicd · 2026-01-26T20:58:11Z

PR_Github #33628 [ run ] triggered by Bot. Commit: c566908

galagam requested review from a team as code owners January 19, 2026 20:33

galagam requested review from QiJune, Shixiaowei02 and lucaslie January 19, 2026 20:33

coderabbitai bot reviewed Jan 19, 2026

View reviewed changes

tests/unittest/_torch/auto_deploy/unit/singlegpu/custom_ops/test_trtllm_moe.py Outdated Show resolved Hide resolved

galagam marked this pull request as draft January 20, 2026 09:54

galagam self-assigned this Jan 20, 2026

galagam force-pushed the gagam/handle-non-identical-moe-scales branch from 1c7e940 to 86396a1 Compare January 20, 2026 16:27

galagam marked this pull request as ready for review January 20, 2026 16:27

galagam force-pushed the gagam/handle-non-identical-moe-scales branch from 86396a1 to c2f5e37 Compare January 20, 2026 17:49

galagam force-pushed the gagam/handle-non-identical-moe-scales branch from c2f5e37 to caa90fa Compare January 20, 2026 18:14

galagam commented Jan 20, 2026

View reviewed changes

galagam force-pushed the gagam/handle-non-identical-moe-scales branch from 96b3ee5 to 009e418 Compare January 26, 2026 12:00

lucaslie reviewed Jan 26, 2026

View reviewed changes

galagam added 11 commits January 26, 2026 09:35

separate trtllm and triton tests; fix calls to use updated API

b50169e

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

refactor test

bd8405c

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

comments cleanup

ae01010

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

unwaive super fp8

ca3af39

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

fix test lists

0c96898

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

address cr comments

958e110

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

lint

d996a22

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

fixes

2ff40c4

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

address cr

be9423a

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

fix test

c566908

Signed-off-by: Gal Hubara Agam <96368689+galagam@users.noreply.github.com>

galagam force-pushed the gagam/handle-non-identical-moe-scales branch from 2d40ccb to c566908 Compare January 26, 2026 17:43

[#10780][feat] AutoDeploy: Support per-expert scales in FP8 MoE #10814

Are you sure you want to change the base?

[#10780][feat] AutoDeploy: Support per-expert scales in FP8 MoE #10814

Conversation

galagam commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

kill

skip

reuse-pipeline

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Jan 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

galagam commented Jan 20, 2026

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

galagam commented Jan 20, 2026

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

galagam commented Jan 20, 2026

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

galagam commented Jan 20, 2026

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

galagam commented Jan 20, 2026

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

galagam commented Jan 20, 2026

Uh oh!

tensorrt-cicd commented Jan 20, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tensorrt-cicd commented Jan 26, 2026

Uh oh!

tensorrt-cicd commented Jan 26, 2026

Uh oh!

galagam commented Jan 26, 2026

Uh oh!

tensorrt-cicd commented Jan 26, 2026

Uh oh!

tensorrt-cicd commented Jan 26, 2026

Uh oh!

galagam commented Jan 26, 2026

Uh oh!

tensorrt-cicd commented Jan 26, 2026

Uh oh!

tensorrt-cicd commented Jan 26, 2026

Uh oh!

galagam commented Jan 19, 2026 •

edited

Loading

coderabbitai bot commented Jan 19, 2026 •

edited

Loading