Skip to content

Conversation

@soodoshll
Copy link

@soodoshll soodoshll commented Dec 8, 2025

What does this PR do?

Type of change: ? new feature

Overview: ? huggingface transformers library implements Qwen3VL Moe layer as a monolithic module, instead of assembling it using Linear layers, which cannot be recognized by modelopt's quantizer now. This PR introduces a conversion from hf's qwen3vl_moe MoE layers to qewn3_moe MoE layers which consist of a set of Linear layers.

Testing

Tested with

python hf_ptq.py --pyt_ckpt_path=Qwen/Qwen3-VL-30B-A3B-Instruct --qformat=nvfp4 --dataset wikipedia

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes/No
  • Did you write any new necessary tests?: Yes/No
  • Did you add or update any necessary documentation?: Yes/No
  • Did you update Changelog?: Yes/No

Additional Information

Summary by CodeRabbit

Release Notes

  • New Features
    • Added quantization support for Qwen3VL models with sparse mixture-of-experts (MoE) architecture, enabling efficient model compression for this model type.

✏️ Tip: You can customize this high-level summary in your review settings.

@soodoshll soodoshll requested a review from a team as a code owner December 8, 2025 19:32
@soodoshll soodoshll requested a review from ajrasane December 8, 2025 19:32
@copy-pr-bot
Copy link

copy-pr-bot bot commented Dec 8, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@shengliangxu shengliangxu self-requested a review December 8, 2025 19:33
@soodoshll soodoshll requested review from a team as code owners December 8, 2025 19:58
@soodoshll soodoshll requested a review from ynankani December 8, 2025 19:58
@shengliangxu shengliangxu removed the request for review from Edwardf0t1 December 10, 2025 02:00
@codecov
Copy link

codecov bot commented Dec 10, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 74.22%. Comparing base (38fb120) to head (b4919e0).

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #666   +/-   ##
=======================================
  Coverage   74.22%   74.22%           
=======================================
  Files         192      192           
  Lines       19035    19035           
=======================================
  Hits        14129    14129           
  Misses       4906     4906           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Contributor

@shengliangxu shengliangxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@soodoshll soodoshll force-pushed the qwen3-vl-moe branch 3 times, most recently from f5c78c9 to ff666d5 Compare December 18, 2025 20:22
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 16, 2026

Important

Review skipped

Auto incremental reviews are disabled on this repository.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

📝 Walkthrough

Walkthrough

A new quantization wrapper class _QuantQwen3VLMoeTextExperts is introduced to handle Qwen3 VL MoE text experts by replacing weight attributes with nn.Linear-based modules. The class includes forward pass logic for expert selection and gating, with duplicated registration blocks mapping to the QuantModuleRegistry.

Changes

Cohort / File(s) Summary
Qwen3 VL MoE Quantization Module
modelopt/torch/quantization/plugins/huggingface.py
Added _QuantQwen3VLMoeTextExperts wrapper class with weight conversion to nn.Linear modules for gate_up_proj, up_proj, and down_proj. Implements forward pass for expert selection with routing-based gating and weighted output accumulation. Added import-and-register logic with duplicated registration blocks for Qwen3VLMoeTextSparseMoeBlock and Qwen3VLMoeTextExperts to ensure consistent module registration regardless of import order.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1
❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title 'Add Quantizers for Qwen3VLMoeTextDecoderLayer' is specific and directly related to the main change: adding quantization support for Qwen3VL MoE layers by introducing wrapper classes and registration logic.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Fix all issues with AI agents
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 630-640: expert_idx is a 1-element tensor and is being used to
index nn.ModuleList (self.gate_proj/self.up_proj/self.down_proj) and
routing_weights, which raises a TypeError; convert expert_idx to a Python int
(e.g., expert_idx_int = int(expert_idx.item() or expert_idx[0].item()) ) before
using it to index the ModuleList entries and use that int for routing_weights
indexing (routing_weights[token_idx, expert_idx_int, None]) while keeping
token_idx as a tensor for per-token selection, then proceed with
gate_proj/up_proj/down_proj using the integer index and next_states.index_add_
as before.
- Around line 574-616: In _QuantQwen3VLMoeTextExperts._setup the code references
a non-existent attribute self.expert_dim when slicing self.gate_up_proj, which
will raise AttributeError; update the slices to use the correct per-expert size
attribute self.moe_intermediate_size (i.e. replace uses of self.expert_dim in
the gate_up_proj slicing with self.moe_intermediate_size) so the _copy_weight
calls for gate_proj and up_proj use the proper slice, leaving down_proj and
attribute reassignment (gate_up_proj → gate_proj, down_proj → down_proj)
unchanged.
📜 Review details

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e6e4efd and 41efac8.

📒 Files selected for processing (1)
  • modelopt/torch/quantization/plugins/huggingface.py
🧰 Additional context used
🧬 Code graph analysis (1)
modelopt/torch/quantization/plugins/huggingface.py (1)
modelopt/torch/quantization/conversion.py (1)
  • register (330-371)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (1)
modelopt/torch/quantization/plugins/huggingface.py (1)

735-755: LGTM!

The registration blocks follow the established pattern used throughout this file for conditional module registration. The try/except ImportError guards appropriately handle cases where the transformers version doesn't include the Qwen3VL MoE model.

✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.

Comment on lines 574 to 616
class _QuantQwen3VLMoeTextExperts(QuantModule):
def _setup(self):
"""Modify the Qwen3VLMoeTextExperts by using nn.Linear layers."""
from accelerate import init_empty_weights

dtype, device = self.gate_up_proj.dtype, self.gate_up_proj.device

def _copy_weight(module, weight):
module.to_empty(device=device)
with torch.no_grad():
module.weight.data = weight.detach().data.to(dtype=dtype, device=device)

with init_empty_weights():
gate_proj = nn.ModuleList(
[
nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
for _ in range(self.num_experts)
]
)
up_proj = nn.ModuleList(
[
nn.Linear(self.hidden_size, self.intermediate_size, bias=False)
for _ in range(self.num_experts)
]
)
down_proj = nn.ModuleList(
[
nn.Linear(self.intermediate_size, self.hidden_size, bias=False)
for _ in range(self.num_experts)
]
)

for idx in range(self.num_experts):
_copy_weight(gate_proj[idx], self.gate_up_proj[idx, :, : self.expert_dim].T)
_copy_weight(up_proj[idx], self.gate_up_proj[idx, :, self.expert_dim :].T)
_copy_weight(down_proj[idx], self.down_proj[idx, :].T)

delattr(self, "gate_up_proj")
delattr(self, "down_proj")
self.gate_proj = gate_proj
self.up_proj = up_proj
self.down_proj = down_proj

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# First check if transformers is available and inspect Qwen3VLMoeTextExperts
python3 << 'EOF'
import sys
try:
    # Try to import the class
    from transformers.models.qwen3_vl_moe.modeling_qwen3_vl_moe import Qwen3VLMoeTextExperts
    import inspect
    
    print("=== Qwen3VLMoeTextExperts Class Found ===")
    
    # Get source code
    try:
        source = inspect.getsource(Qwen3VLMoeTextExperts)
        # Check for expert_dim
        if 'expert_dim' in source:
            print("✓ expert_dim FOUND in class source")
        else:
            print("✗ expert_dim NOT found in class source")
        
        # Check for intermediate_size
        if 'intermediate_size' in source:
            print("✓ intermediate_size found in class source")
        
        # Check __init__ signature
        init_sig = inspect.signature(Qwen3VLMoeTextExperts.__init__)
        print(f"__init__ parameters: {list(init_sig.parameters.keys())}")
        
    except Exception as e:
        print(f"Could not get source: {e}")
        
except ImportError as e:
    print(f"✗ Could not import Qwen3VLMoeTextExperts: {e}")
    print("\nTrying alternative import paths...")
    
    try:
        from transformers import AutoModel
        print("AutoModel available, but Qwen3VLMoeTextExperts not directly importable")
    except:
        print("Transformers not fully available")
EOF

Repository: NVIDIA/Model-Optimizer

Length of output: 152


🏁 Script executed:

#!/bin/bash
# Also search the reviewed file for any expert_dim references
echo "=== Searching reviewed file for expert_dim references ==="
if [ -f "modelopt/torch/quantization/plugins/huggingface.py" ]; then
    rg "expert_dim" modelopt/torch/quantization/plugins/huggingface.py -n -C 2
else
    echo "File not found at expected location"
fi

Repository: NVIDIA/Model-Optimizer

Length of output: 436


🌐 Web query:

Qwen3VLMoeTextExperts expert_dim attribute huggingface transformers

💡 Result:

Short answer: Hugging Face's Qwen3/VL-MoE configs do not expose an attribute named expert_dim. For MoE in transformers you should use:

  • num_experts — number of experts. [1][2]
  • num_experts_per_tok (top‑k) — number of experts routed per token. [1][2]
  • moe_intermediate_size — the hidden (intermediate) size used inside each routed expert (what some codebases call an expert hidden-dimension). [1][3]

Evidence:

  • Qwen3MoE / Qwen3VLMoeConfig docs list num_experts, num_experts_per_tok and moe_intermediate_size but do not define expert_dim. [1][2][3]
  • Other libraries sometimes use the name expert_dim (e.g., MindSpore examples) to mean either the number of experts or the per‑expert dimension depending on their API — so naming differs between frameworks. [4][5]

If you expected an expert_dim field in transformers, use moe_intermediate_size for per-expert hidden size and num_experts for expert count. References: Hugging Face transformers Qwen3/Qwen3-VL-MoE docs and MindSpore MoE docs. [1][2][3][4]

Sources
[1] Hugging Face — Qwen3MoE config docs.
[2] Hugging Face — Qwen3-VL-Moe config docs.
[3] Hugging Face (alternate version) — Qwen3-VL-Moe parameters.
[4] MindSpore MoE implementation (expert_dim usage).
[5] MindSpore MoE (other release).


Remove undefined expert_dim attribute reference — this will cause AttributeError at runtime.

The expert_dim attribute does not exist on HuggingFace's Qwen3VLMoeTextExperts class. Lines 607–608 access self.expert_dim, which will fail at runtime. The correct attribute for the per-expert intermediate size is moe_intermediate_size. Replace:

self.gate_up_proj[idx, :, : self.expert_dim].T
self.gate_up_proj[idx, :, self.expert_dim :].T

with:

self.gate_up_proj[idx, :, : self.moe_intermediate_size].T
self.gate_up_proj[idx, :, self.moe_intermediate_size :].T
🤖 Prompt for AI Agents
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 574 - 616,
In _QuantQwen3VLMoeTextExperts._setup the code references a non-existent
attribute self.expert_dim when slicing self.gate_up_proj, which will raise
AttributeError; update the slices to use the correct per-expert size attribute
self.moe_intermediate_size (i.e. replace uses of self.expert_dim in the
gate_up_proj slicing with self.moe_intermediate_size) so the _copy_weight calls
for gate_proj and up_proj use the proper slice, leaving down_proj and attribute
reassignment (gate_up_proj → gate_proj, down_proj → down_proj) unchanged.

Comment on lines 630 to 640
for expert_idx in expert_hit[:]:
assert expert_idx.numel() == 1, expert_idx
with torch.no_grad():
_, token_idx = torch.where(expert_mask[expert_idx[0]])
current_state = hidden_states[token_idx]
gate = self.gate_proj[expert_idx](current_state)
up = self.up_proj[expert_idx](current_state)
gated_output = up * self.act_fn(gate)
out = self.down_proj[expert_idx](gated_output)
weighted_output = out * routing_weights[token_idx, expert_idx, None]
next_states.index_add_(0, token_idx, weighted_output.to(hidden_states.dtype))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bug: Using tensor to index nn.ModuleList will fail.

When iterating over expert_hit (from nonzero()), each expert_idx is a 1D tensor of shape (1,), not a Python integer. nn.ModuleList.__getitem__ expects an integer index, so self.gate_proj[expert_idx] will raise a TypeError.

Convert expert_idx to an integer before indexing:

🐛 Proposed fix
-    for expert_idx in expert_hit[:]:
-        assert expert_idx.numel() == 1, expert_idx
+    for expert_idx in expert_hit:
+        expert_idx = expert_idx.item()
         with torch.no_grad():
-            _, token_idx = torch.where(expert_mask[expert_idx[0]])
+            _, token_idx = torch.where(expert_mask[expert_idx])
         current_state = hidden_states[token_idx]
         gate = self.gate_proj[expert_idx](current_state)
         up = self.up_proj[expert_idx](current_state)
         gated_output = up * self.act_fn(gate)
         out = self.down_proj[expert_idx](gated_output)
-        weighted_output = out * routing_weights[token_idx, expert_idx, None]
+        weighted_output = out * routing_weights[token_idx, expert_idx:expert_idx+1, None]
         next_states.index_add_(0, token_idx, weighted_output.to(hidden_states.dtype))

Note: The routing_weights[token_idx, expert_idx, None] indexing at line 639 may also need adjustment after converting to int, depending on the expected tensor shape.

🤖 Prompt for AI Agents
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 630 - 640,
expert_idx is a 1-element tensor and is being used to index nn.ModuleList
(self.gate_proj/self.up_proj/self.down_proj) and routing_weights, which raises a
TypeError; convert expert_idx to a Python int (e.g., expert_idx_int =
int(expert_idx.item() or expert_idx[0].item()) ) before using it to index the
ModuleList entries and use that int for routing_weights indexing
(routing_weights[token_idx, expert_idx_int, None]) while keeping token_idx as a
tensor for per-token selection, then proceed with gate_proj/up_proj/down_proj
using the integer index and next_states.index_add_ as before.

Copy link
Contributor

@Edwardf0t1 Edwardf0t1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, left a few minor comments.

@shengliangxu shengliangxu requested a review from kaix-nv January 16, 2026 19:43
Copy link

@Victor49152 Victor49152 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

New change about attribute naming has been verified, should be clear to go

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants