-
Notifications
You must be signed in to change notification settings - Fork 241
Add Quantizers for Qwen3VLMoeTextDecoderLayer #666
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
2c18fd0 to
7a16307
Compare
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #666 +/- ##
=======================================
Coverage 74.22% 74.22%
=======================================
Files 192 192
Lines 19035 19035
=======================================
Hits 14129 14129
Misses 4906 4906 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
shengliangxu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
f5c78c9 to
ff666d5
Compare
Signed-off-by: Qidong Su <[email protected]> Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]> Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]> Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]> Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]> Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]> Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]> Signed-off-by: Qidong Su <[email protected]>
ff666d5 to
37c24f4
Compare
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the 📝 WalkthroughWalkthroughA new quantization wrapper class Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🤖 Fix all issues with AI agents
In `@modelopt/torch/quantization/plugins/huggingface.py`:
- Around line 630-640: expert_idx is a 1-element tensor and is being used to
index nn.ModuleList (self.gate_proj/self.up_proj/self.down_proj) and
routing_weights, which raises a TypeError; convert expert_idx to a Python int
(e.g., expert_idx_int = int(expert_idx.item() or expert_idx[0].item()) ) before
using it to index the ModuleList entries and use that int for routing_weights
indexing (routing_weights[token_idx, expert_idx_int, None]) while keeping
token_idx as a tensor for per-token selection, then proceed with
gate_proj/up_proj/down_proj using the integer index and next_states.index_add_
as before.
- Around line 574-616: In _QuantQwen3VLMoeTextExperts._setup the code references
a non-existent attribute self.expert_dim when slicing self.gate_up_proj, which
will raise AttributeError; update the slices to use the correct per-expert size
attribute self.moe_intermediate_size (i.e. replace uses of self.expert_dim in
the gate_up_proj slicing with self.moe_intermediate_size) so the _copy_weight
calls for gate_proj and up_proj use the proper slice, leaving down_proj and
attribute reassignment (gate_up_proj → gate_proj, down_proj → down_proj)
unchanged.
📜 Review details
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
modelopt/torch/quantization/plugins/huggingface.py
🧰 Additional context used
🧬 Code graph analysis (1)
modelopt/torch/quantization/plugins/huggingface.py (1)
modelopt/torch/quantization/conversion.py (1)
register(330-371)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (1)
modelopt/torch/quantization/plugins/huggingface.py (1)
735-755: LGTM!The registration blocks follow the established pattern used throughout this file for conditional module registration. The
try/except ImportErrorguards appropriately handle cases where the transformers version doesn't include the Qwen3VL MoE model.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| class _QuantQwen3VLMoeTextExperts(QuantModule): | ||
| def _setup(self): | ||
| """Modify the Qwen3VLMoeTextExperts by using nn.Linear layers.""" | ||
| from accelerate import init_empty_weights | ||
|
|
||
| dtype, device = self.gate_up_proj.dtype, self.gate_up_proj.device | ||
|
|
||
| def _copy_weight(module, weight): | ||
| module.to_empty(device=device) | ||
| with torch.no_grad(): | ||
| module.weight.data = weight.detach().data.to(dtype=dtype, device=device) | ||
|
|
||
| with init_empty_weights(): | ||
| gate_proj = nn.ModuleList( | ||
| [ | ||
| nn.Linear(self.hidden_size, self.intermediate_size, bias=False) | ||
| for _ in range(self.num_experts) | ||
| ] | ||
| ) | ||
| up_proj = nn.ModuleList( | ||
| [ | ||
| nn.Linear(self.hidden_size, self.intermediate_size, bias=False) | ||
| for _ in range(self.num_experts) | ||
| ] | ||
| ) | ||
| down_proj = nn.ModuleList( | ||
| [ | ||
| nn.Linear(self.intermediate_size, self.hidden_size, bias=False) | ||
| for _ in range(self.num_experts) | ||
| ] | ||
| ) | ||
|
|
||
| for idx in range(self.num_experts): | ||
| _copy_weight(gate_proj[idx], self.gate_up_proj[idx, :, : self.expert_dim].T) | ||
| _copy_weight(up_proj[idx], self.gate_up_proj[idx, :, self.expert_dim :].T) | ||
| _copy_weight(down_proj[idx], self.down_proj[idx, :].T) | ||
|
|
||
| delattr(self, "gate_up_proj") | ||
| delattr(self, "down_proj") | ||
| self.gate_proj = gate_proj | ||
| self.up_proj = up_proj | ||
| self.down_proj = down_proj | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# First check if transformers is available and inspect Qwen3VLMoeTextExperts
python3 << 'EOF'
import sys
try:
# Try to import the class
from transformers.models.qwen3_vl_moe.modeling_qwen3_vl_moe import Qwen3VLMoeTextExperts
import inspect
print("=== Qwen3VLMoeTextExperts Class Found ===")
# Get source code
try:
source = inspect.getsource(Qwen3VLMoeTextExperts)
# Check for expert_dim
if 'expert_dim' in source:
print("✓ expert_dim FOUND in class source")
else:
print("✗ expert_dim NOT found in class source")
# Check for intermediate_size
if 'intermediate_size' in source:
print("✓ intermediate_size found in class source")
# Check __init__ signature
init_sig = inspect.signature(Qwen3VLMoeTextExperts.__init__)
print(f"__init__ parameters: {list(init_sig.parameters.keys())}")
except Exception as e:
print(f"Could not get source: {e}")
except ImportError as e:
print(f"✗ Could not import Qwen3VLMoeTextExperts: {e}")
print("\nTrying alternative import paths...")
try:
from transformers import AutoModel
print("AutoModel available, but Qwen3VLMoeTextExperts not directly importable")
except:
print("Transformers not fully available")
EOFRepository: NVIDIA/Model-Optimizer
Length of output: 152
🏁 Script executed:
#!/bin/bash
# Also search the reviewed file for any expert_dim references
echo "=== Searching reviewed file for expert_dim references ==="
if [ -f "modelopt/torch/quantization/plugins/huggingface.py" ]; then
rg "expert_dim" modelopt/torch/quantization/plugins/huggingface.py -n -C 2
else
echo "File not found at expected location"
fiRepository: NVIDIA/Model-Optimizer
Length of output: 436
🌐 Web query:
Qwen3VLMoeTextExperts expert_dim attribute huggingface transformers
💡 Result:
Short answer: Hugging Face's Qwen3/VL-MoE configs do not expose an attribute named expert_dim. For MoE in transformers you should use:
- num_experts — number of experts. [1][2]
- num_experts_per_tok (top‑k) — number of experts routed per token. [1][2]
- moe_intermediate_size — the hidden (intermediate) size used inside each routed expert (what some codebases call an expert hidden-dimension). [1][3]
Evidence:
- Qwen3MoE / Qwen3VLMoeConfig docs list num_experts, num_experts_per_tok and moe_intermediate_size but do not define expert_dim. [1][2][3]
- Other libraries sometimes use the name expert_dim (e.g., MindSpore examples) to mean either the number of experts or the per‑expert dimension depending on their API — so naming differs between frameworks. [4][5]
If you expected an expert_dim field in transformers, use moe_intermediate_size for per-expert hidden size and num_experts for expert count. References: Hugging Face transformers Qwen3/Qwen3-VL-MoE docs and MindSpore MoE docs. [1][2][3][4]
Sources
[1] Hugging Face — Qwen3MoE config docs.
[2] Hugging Face — Qwen3-VL-Moe config docs.
[3] Hugging Face (alternate version) — Qwen3-VL-Moe parameters.
[4] MindSpore MoE implementation (expert_dim usage).
[5] MindSpore MoE (other release).
Remove undefined expert_dim attribute reference — this will cause AttributeError at runtime.
The expert_dim attribute does not exist on HuggingFace's Qwen3VLMoeTextExperts class. Lines 607–608 access self.expert_dim, which will fail at runtime. The correct attribute for the per-expert intermediate size is moe_intermediate_size. Replace:
self.gate_up_proj[idx, :, : self.expert_dim].T
self.gate_up_proj[idx, :, self.expert_dim :].Twith:
self.gate_up_proj[idx, :, : self.moe_intermediate_size].T
self.gate_up_proj[idx, :, self.moe_intermediate_size :].T🤖 Prompt for AI Agents
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 574 - 616,
In _QuantQwen3VLMoeTextExperts._setup the code references a non-existent
attribute self.expert_dim when slicing self.gate_up_proj, which will raise
AttributeError; update the slices to use the correct per-expert size attribute
self.moe_intermediate_size (i.e. replace uses of self.expert_dim in the
gate_up_proj slicing with self.moe_intermediate_size) so the _copy_weight calls
for gate_proj and up_proj use the proper slice, leaving down_proj and attribute
reassignment (gate_up_proj → gate_proj, down_proj → down_proj) unchanged.
| for expert_idx in expert_hit[:]: | ||
| assert expert_idx.numel() == 1, expert_idx | ||
| with torch.no_grad(): | ||
| _, token_idx = torch.where(expert_mask[expert_idx[0]]) | ||
| current_state = hidden_states[token_idx] | ||
| gate = self.gate_proj[expert_idx](current_state) | ||
| up = self.up_proj[expert_idx](current_state) | ||
| gated_output = up * self.act_fn(gate) | ||
| out = self.down_proj[expert_idx](gated_output) | ||
| weighted_output = out * routing_weights[token_idx, expert_idx, None] | ||
| next_states.index_add_(0, token_idx, weighted_output.to(hidden_states.dtype)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bug: Using tensor to index nn.ModuleList will fail.
When iterating over expert_hit (from nonzero()), each expert_idx is a 1D tensor of shape (1,), not a Python integer. nn.ModuleList.__getitem__ expects an integer index, so self.gate_proj[expert_idx] will raise a TypeError.
Convert expert_idx to an integer before indexing:
🐛 Proposed fix
- for expert_idx in expert_hit[:]:
- assert expert_idx.numel() == 1, expert_idx
+ for expert_idx in expert_hit:
+ expert_idx = expert_idx.item()
with torch.no_grad():
- _, token_idx = torch.where(expert_mask[expert_idx[0]])
+ _, token_idx = torch.where(expert_mask[expert_idx])
current_state = hidden_states[token_idx]
gate = self.gate_proj[expert_idx](current_state)
up = self.up_proj[expert_idx](current_state)
gated_output = up * self.act_fn(gate)
out = self.down_proj[expert_idx](gated_output)
- weighted_output = out * routing_weights[token_idx, expert_idx, None]
+ weighted_output = out * routing_weights[token_idx, expert_idx:expert_idx+1, None]
next_states.index_add_(0, token_idx, weighted_output.to(hidden_states.dtype))Note: The routing_weights[token_idx, expert_idx, None] indexing at line 639 may also need adjustment after converting to int, depending on the expected tensor shape.
🤖 Prompt for AI Agents
In `@modelopt/torch/quantization/plugins/huggingface.py` around lines 630 - 640,
expert_idx is a 1-element tensor and is being used to index nn.ModuleList
(self.gate_proj/self.up_proj/self.down_proj) and routing_weights, which raises a
TypeError; convert expert_idx to a Python int (e.g., expert_idx_int =
int(expert_idx.item() or expert_idx[0].item()) ) before using it to index the
ModuleList entries and use that int for routing_weights indexing
(routing_weights[token_idx, expert_idx_int, None]) while keeping token_idx as a
tensor for per-token selection, then proceed with gate_proj/up_proj/down_proj
using the integer index and next_states.index_add_ as before.
Edwardf0t1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, left a few minor comments.
Co-authored-by: Zhiyu <[email protected]> Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Signed-off-by: Qidong Su <[email protected]>
Victor49152
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New change about attribute naming has been verified, should be clear to go
What does this PR do?
Type of change: ? new feature
Overview: ? huggingface transformers library implements Qwen3VL Moe layer as a monolithic module, instead of assembling it using Linear layers, which cannot be recognized by modelopt's quantizer now. This PR introduces a conversion from hf's qwen3vl_moe MoE layers to qewn3_moe MoE layers which consist of a set of Linear layers.
Testing
Tested with
Before your PR is "Ready for review"
Additional Information
Summary by CodeRabbit
Release Notes
✏️ Tip: You can customize this high-level summary in your review settings.