Add Quantizers for Qwen3VLMoeTextDecoderLayer #666

soodoshll · 2025-12-08T19:32:05Z

What does this PR do?

Type of change: ? new feature

Overview: ? huggingface transformers library implements Qwen3VL Moe layer as a monolithic module, instead of assembling it using Linear layers, which cannot be recognized by modelopt's quantizer now. This PR introduces a conversion from hf's qwen3vl_moe MoE layers to qewn3_moe MoE layers which consist of a set of Linear layers.

Testing

Tested with

python hf_ptq.py --pyt_ckpt_path=Qwen/Qwen3-VL-30B-A3B-Instruct --qformat=nvfp4 --dataset wikipedia

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes/No
Did you write any new necessary tests?: Yes/No
Did you add or update any necessary documentation?: Yes/No
Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

Release Notes

New Features
- Added quantization support for Qwen3VL models with sparse mixture-of-experts (MoE) architecture, enabling efficient model compression for this model type.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

copy-pr-bot · 2025-12-08T19:32:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

codecov · 2025-12-10T18:01:47Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.22%. Comparing base (38fb120) to head (b4919e0).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #666   +/-   ##
=======================================
  Coverage   74.22%   74.22%           
=======================================
  Files         192      192           
  Lines       19035    19035           
=======================================
  Hits        14129    14129           
  Misses       4906     4906

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

modelopt/torch/quantization/plugins/huggingface.py

shengliangxu

LGTM

Signed-off-by: Qidong Su <[email protected]> Signed-off-by: Qidong Su <[email protected]>

coderabbitai · 2026-01-16T19:15:59Z

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

A new quantization wrapper class _QuantQwen3VLMoeTextExperts is introduced to handle Qwen3 VL MoE text experts by replacing weight attributes with nn.Linear-based modules. The class includes forward pass logic for expert selection and gating, with duplicated registration blocks mapping to the QuantModuleRegistry.

Changes

Cohort / File(s)	Summary
Qwen3 VL MoE Quantization Module `modelopt/torch/quantization/plugins/huggingface.py`	Added `_QuantQwen3VLMoeTextExperts` wrapper class with weight conversion to nn.Linear modules for gate_up_proj, up_proj, and down_proj. Implements forward pass for expert selection with routing-based gating and weighted output accumulation. Added import-and-register logic with duplicated registration blocks for Qwen3VLMoeTextSparseMoeBlock and Qwen3VLMoeTextExperts to ensure consistent module registration regardless of import order.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Add Quantizers for Qwen3VLMoeTextDecoderLayer' is specific and directly related to the main change: adding quantization support for Qwen3VL MoE layers by introducing wrapper classes and registration logic.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 2

🤖 Fix all issues with AI agents

In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 630-640: expert_idx is a 1-element tensor and is being used to
index nn.ModuleList (self.gate_proj/self.up_proj/self.down_proj) and
routing_weights, which raises a TypeError; convert expert_idx to a Python int
(e.g., expert_idx_int = int(expert_idx.item() or expert_idx[0].item()) ) before
using it to index the ModuleList entries and use that int for routing_weights
indexing (routing_weights[token_idx, expert_idx_int, None]) while keeping
token_idx as a tensor for per-token selection, then proceed with
gate_proj/up_proj/down_proj using the integer index and next_states.index_add_
as before.
- Around line 574-616: In _QuantQwen3VLMoeTextExperts._setup the code references
a non-existent attribute self.expert_dim when slicing self.gate_up_proj, which
will raise AttributeError; update the slices to use the correct per-expert size
attribute self.moe_intermediate_size (i.e. replace uses of self.expert_dim in
the gate_up_proj slicing with self.moe_intermediate_size) so the _copy_weight
calls for gate_proj and up_proj use the proper slice, leaving down_proj and
attribute reassignment (gate_up_proj → gate_proj, down_proj → down_proj)
unchanged.

📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6e4efd and 41efac8.

📒 Files selected for processing (1)

modelopt/torch/quantization/plugins/huggingface.py

🧰 Additional context used

🧬 Code graph analysis (1)

modelopt/torch/quantization/plugins/huggingface.py (1)

modelopt/torch/quantization/conversion.py (1)

register (330-371)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: wait-checks / wait
GitHub Check: code-quality
GitHub Check: build-docs

🔇 Additional comments (1)

modelopt/torch/quantization/plugins/huggingface.py (1)

735-755: LGTM!

The registration blocks follow the established pattern used throughout this file for conditional module registration. The try/except ImportError guards appropriately handle cases where the transformers version doesn't include the Qwen3VL MoE model.

_{✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.}

coderabbitai · 2026-01-16T19:21:28Z

modelopt/torch/quantization/plugins/huggingface.py

+class _QuantQwen3VLMoeTextExperts(QuantModule):
+    def _setup(self):
+        """Modify the Qwen3VLMoeTextExperts by using nn.Linear layers."""
+        from accelerate import init_empty_weights
+
+        dtype, device = self.gate_up_proj.dtype, self.gate_up_proj.device
+
+        def _copy_weight(module, weight):
+            module.to_empty(device=device)
+            with torch.no_grad():
+                module.weight.data = weight.detach().data.to(dtype=dtype, device=device)
+
+        with init_empty_weights():
+            gate_proj = nn.ModuleList(
+                [
+                    nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+                    for _ in range(self.num_experts)
+                ]
+            )
+            up_proj = nn.ModuleList(
+                [
+                    nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
+                    for _ in range(self.num_experts)
+                ]
+            )
+            down_proj = nn.ModuleList(
+                [
+                    nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
+                    for _ in range(self.num_experts)
+                ]
+            )
+
+        for idx in range(self.num_experts):
+            _copy_weight(gate_proj[idx], self.gate_up_proj[idx, :, : self.expert_dim].T)
+            _copy_weight(up_proj[idx], self.gate_up_proj[idx, :, self.expert_dim :].T)
+            _copy_weight(down_proj[idx], self.down_proj[idx, :].T)
+
+        delattr(self, "gate_up_proj")
+        delattr(self, "down_proj")
+        self.gate_proj = gate_proj
+        self.up_proj = up_proj
+        self.down_proj = down_proj
+


⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # First check if transformers is available and inspect Qwen3VLMoeTextExperts python3 << 'EOF' import sys try: # Try to import the class from transformers.models.qwen3_vl_moe.modeling_qwen3_vl_moe import Qwen3VLMoeTextExperts import inspect print("=== Qwen3VLMoeTextExperts Class Found ===") # Get source code try: source = inspect.getsource(Qwen3VLMoeTextExperts) # Check for expert_dim if 'expert_dim' in source: print("✓ expert_dim FOUND in class source") else: print("✗ expert_dim NOT found in class source") # Check for intermediate_size if 'intermediate_size' in source: print("✓ intermediate_size found in class source") # Check __init__ signature init_sig = inspect.signature(Qwen3VLMoeTextExperts.__init__) print(f"__init__ parameters: {list(init_sig.parameters.keys())}") except Exception as e: print(f"Could not get source: {e}") except ImportError as e: print(f"✗ Could not import Qwen3VLMoeTextExperts: {e}") print("\nTrying alternative import paths...") try: from transformers import AutoModel print("AutoModel available, but Qwen3VLMoeTextExperts not directly importable") except: print("Transformers not fully available") EOF

Repository: NVIDIA/Model-Optimizer

Length of output: 152

🏁 Script executed:

#!/bin/bash # Also search the reviewed file for any expert_dim references echo "=== Searching reviewed file for expert_dim references ===" if [ -f "modelopt/torch/quantization/plugins/huggingface.py" ]; then rg "expert_dim" modelopt/torch/quantization/plugins/huggingface.py -n -C 2 else echo "File not found at expected location" fi

Repository: NVIDIA/Model-Optimizer

Length of output: 436

🌐 Web query:

Qwen3VLMoeTextExperts expert_dim attribute huggingface transformers

💡 Result:

Short answer: Hugging Face's Qwen3/VL-MoE configs do not expose an attribute named expert_dim. For MoE in transformers you should use:

num_experts — number of experts. [1][2]

num_experts_per_tok (top‑k) — number of experts routed per token. [1][2]

moe_intermediate_size — the hidden (intermediate) size used inside each routed expert (what some codebases call an expert hidden-dimension). [1][3]

Evidence:

Qwen3MoE / Qwen3VLMoeConfig docs list num_experts, num_experts_per_tok and moe_intermediate_size but do not define expert_dim. [1][2][3]

Other libraries sometimes use the name expert_dim (e.g., MindSpore examples) to mean either the number of experts or the per‑expert dimension depending on their API — so naming differs between frameworks. [4][5]

If you expected an expert_dim field in transformers, use moe_intermediate_size for per-expert hidden size and num_experts for expert count. References: Hugging Face transformers Qwen3/Qwen3-VL-MoE docs and MindSpore MoE docs. [1][2][3][4]

Sources
[1] Hugging Face — Qwen3MoE config docs.
[2] Hugging Face — Qwen3-VL-Moe config docs.
[3] Hugging Face (alternate version) — Qwen3-VL-Moe parameters.
[4] MindSpore MoE implementation (expert_dim usage).
[5] MindSpore MoE (other release).

Remove undefined expert_dim attribute reference — this will cause AttributeError at runtime.

The expert_dim attribute does not exist on HuggingFace's Qwen3VLMoeTextExperts class. Lines 607–608 access self.expert_dim, which will fail at runtime. The correct attribute for the per-expert intermediate size is moe_intermediate_size. Replace:

self.gate_up_proj[idx, :, : self.expert_dim].T self.gate_up_proj[idx, :, self.expert_dim :].T

with:

self.gate_up_proj[idx, :, : self.moe_intermediate_size].T self.gate_up_proj[idx, :, self.moe_intermediate_size :].T

🤖 Prompt for AI Agents

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 574 - 616, In _QuantQwen3VLMoeTextExperts._setup the code references a non-existent attribute self.expert_dim when slicing self.gate_up_proj, which will raise AttributeError; update the slices to use the correct per-expert size attribute self.moe_intermediate_size (i.e. replace uses of self.expert_dim in the gate_up_proj slicing with self.moe_intermediate_size) so the _copy_weight calls for gate_proj and up_proj use the proper slice, leaving down_proj and attribute reassignment (gate_up_proj → gate_proj, down_proj → down_proj) unchanged.

coderabbitai · 2026-01-16T19:21:28Z

modelopt/torch/quantization/plugins/huggingface.py

+        for expert_idx in expert_hit[:]:
+            assert expert_idx.numel() == 1, expert_idx
+            with torch.no_grad():
+                _, token_idx = torch.where(expert_mask[expert_idx[0]])
+            current_state = hidden_states[token_idx]
+            gate = self.gate_proj[expert_idx](current_state)
+            up = self.up_proj[expert_idx](current_state)
+            gated_output = up * self.act_fn(gate)
+            out = self.down_proj[expert_idx](gated_output)
+            weighted_output = out * routing_weights[token_idx, expert_idx, None]
+            next_states.index_add_(0, token_idx, weighted_output.to(hidden_states.dtype))


⚠️ Potential issue | 🔴 Critical

Bug: Using tensor to index nn.ModuleList will fail.

When iterating over expert_hit (from nonzero()), each expert_idx is a 1D tensor of shape (1,), not a Python integer. nn.ModuleList.__getitem__ expects an integer index, so self.gate_proj[expert_idx] will raise a TypeError.

Convert expert_idx to an integer before indexing:

🐛 Proposed fix

- for expert_idx in expert_hit[:]: - assert expert_idx.numel() == 1, expert_idx + for expert_idx in expert_hit: + expert_idx = expert_idx.item() with torch.no_grad(): - _, token_idx = torch.where(expert_mask[expert_idx[0]]) + _, token_idx = torch.where(expert_mask[expert_idx]) current_state = hidden_states[token_idx] gate = self.gate_proj[expert_idx](current_state) up = self.up_proj[expert_idx](current_state) gated_output = up * self.act_fn(gate) out = self.down_proj[expert_idx](gated_output) - weighted_output = out * routing_weights[token_idx, expert_idx, None] + weighted_output = out * routing_weights[token_idx, expert_idx:expert_idx+1, None] next_states.index_add_(0, token_idx, weighted_output.to(hidden_states.dtype))

Note: The routing_weights[token_idx, expert_idx, None] indexing at line 639 may also need adjustment after converting to int, depending on the expected tensor shape.

🤖 Prompt for AI Agents

In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 630 - 640, expert_idx is a 1-element tensor and is being used to index nn.ModuleList (self.gate_proj/self.up_proj/self.down_proj) and routing_weights, which raises a TypeError; convert expert_idx to a Python int (e.g., expert_idx_int = int(expert_idx.item() or expert_idx[0].item()) ) before using it to index the ModuleList entries and use that int for routing_weights indexing (routing_weights[token_idx, expert_idx_int, None]) while keeping token_idx as a tensor for per-token selection, then proceed with gate_proj/up_proj/down_proj using the integer index and next_states.index_add_ as before.

Edwardf0t1

LGTM, left a few minor comments.

modelopt/torch/quantization/plugins/huggingface.py

Co-authored-by: Zhiyu <[email protected]> Signed-off-by: Qidong Su <[email protected]>

Signed-off-by: Qidong Su <[email protected]>

Victor49152

New change about attribute naming has been verified, should be clear to go

soodoshll requested a review from a team as a code owner December 8, 2025 19:32

soodoshll requested a review from ajrasane December 8, 2025 19:32

shengliangxu self-requested a review December 8, 2025 19:33

soodoshll force-pushed the qwen3-vl-moe branch from 2c18fd0 to 7a16307 Compare December 8, 2025 19:58

soodoshll requested review from a team as code owners December 8, 2025 19:58

soodoshll requested a review from ynankani December 8, 2025 19:58

shengliangxu removed the request for review from Edwardf0t1 December 10, 2025 02:00

shengliangxu reviewed Dec 10, 2025

View reviewed changes

modelopt/torch/quantization/plugins/huggingface.py Outdated Show resolved Hide resolved

shengliangxu approved these changes Dec 17, 2025

View reviewed changes

soodoshll force-pushed the qwen3-vl-moe branch 3 times, most recently from f5c78c9 to ff666d5 Compare December 18, 2025 20:22

soodoshll added 7 commits December 18, 2025 20:24

upd

d3013f6

Signed-off-by: Qidong Su <[email protected]> Signed-off-by: Qidong Su <[email protected]>

fix

70ddcc4

Signed-off-by: Qidong Su <[email protected]> Signed-off-by: Qidong Su <[email protected]>

fix

4ac0c87

Signed-off-by: Qidong Su <[email protected]> Signed-off-by: Qidong Su <[email protected]>

fix

904be6a

Signed-off-by: Qidong Su <[email protected]> Signed-off-by: Qidong Su <[email protected]>

upd

ef200ea

Signed-off-by: Qidong Su <[email protected]> Signed-off-by: Qidong Su <[email protected]>

refactor to directly impl qwen3_vl_moe

60b9e75

Signed-off-by: Qidong Su <[email protected]> Signed-off-by: Qidong Su <[email protected]>

format

37c24f4

Signed-off-by: Qidong Su <[email protected]> Signed-off-by: Qidong Su <[email protected]>

soodoshll force-pushed the qwen3-vl-moe branch from ff666d5 to 37c24f4 Compare December 18, 2025 20:24

soodoshll requested a review from shengliangxu January 16, 2026 19:03

Merge branch 'main' into qwen3-vl-moe

41efac8

shengliangxu requested a review from Edwardf0t1 January 16, 2026 19:16

coderabbitai bot reviewed Jan 16, 2026

View reviewed changes

Edwardf0t1 approved these changes Jan 16, 2026

View reviewed changes

shengliangxu requested a review from kaix-nv January 16, 2026 19:43

kaix-nv approved these changes Jan 16, 2026

View reviewed changes

soodoshll and others added 6 commits January 16, 2026 15:19

Update modelopt/torch/quantization/plugins/huggingface.py

ac6d76d

Co-authored-by: Zhiyu <[email protected]> Signed-off-by: Qidong Su <[email protected]>

fix

1c02c90

Signed-off-by: Qidong Su <[email protected]>

fmt

5ad1f7a

Signed-off-by: Qidong Su <[email protected]>

Merge branch 'main' into qwen3-vl-moe

f4c5de5

fix

6fab4a4

Signed-off-by: Qidong Su <[email protected]>

Merge branch 'main' into qwen3-vl-moe

b4919e0

Victor49152 approved these changes Jan 19, 2026

View reviewed changes

Add Quantizers for Qwen3VLMoeTextDecoderLayer #666

Are you sure you want to change the base?

Add Quantizers for Qwen3VLMoeTextDecoderLayer #666

Conversation

soodoshll commented Dec 8, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Testing

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Release Notes

Uh oh!

copy-pr-bot bot commented Dec 8, 2025

Uh oh!

codecov bot commented Dec 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

shengliangxu left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot commented Jan 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Walkthrough

Changes

Estimated code review effort

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Jan 16, 2026

Choose a reason for hiding this comment

Uh oh!

Edwardf0t1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Victor49152 left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

soodoshll commented Dec 8, 2025 •

edited by coderabbitai bot

Loading

codecov bot commented Dec 10, 2025 •

edited

Loading

coderabbitai bot commented Jan 16, 2026 •

edited

Loading